Any computer science geeks?

markw at mohawksoft.com markw at mohawksoft.com
Thu May 18 10:08:02 EDT 2006


> On Thu, May 18, 2006 at 09:06:01AM -0400, markw at mohawksoft.com wrote:
>> While I like to think I'm no slouch in this department, I'd like to see
>> if
>> anyone has a better idea:
>>
>> You have two strings and you want to quantify how similar they are.
>>
>> "Tom Lehrer - The Was The Year That Was" and "That Was The Year That Was
>> -
>> Tom Lehrer"
>>
>> (2) Levenshtein Distance is an N^2 algorithm and, well, yuck.
>>
>> Anyone have any ideas? I currently using an n^2 algorithm that scans
>> through both strings and used run length matching, but if you know of
>> something better, I'd love to hear it.
>
> I don't have anything faster, but you could, for example, apply
> SOUNDEX to each word, sort the result, and count how many were
> the same.
>
> You'll probably want an enhanced SOUNDEX... see
> http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm

As for soundex, I prefer metaphone over soundex. Conceptually, soundex is
similar to a hash value, i.e. a pseudo-unique value best used for indexing
where as metaphone breaks up the word into phoneitic components.

Words parsed and sorted can tell you how many similar words there are, but
can't really detail similarity instructure, there also needs to be a
component that details how many words in a row match. Then there is
matching "like" and "likely."

Think about this:

I have two database records:
'That was the year that was"
and
"EMI: That Was the year that was - Tom Lehrer"

I should be able to look at that and say, with some level of confidence,
that they are very close. Words repeat, and "that" and "was" are hardly
significant in their comonality, but "that was the year that was" in that
order is the real meat of the string.






More information about the Discuss mailing list