Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
> On Thu, May 18, 2006 at 09:06:01AM -0400, markw at mohawksoft.com wrote: >> While I like to think I'm no slouch in this department, I'd like to see >> if >> anyone has a better idea: >> >> You have two strings and you want to quantify how similar they are. >> >> "Tom Lehrer - The Was The Year That Was" and "That Was The Year That Was >> - >> Tom Lehrer" >> >> (2) Levenshtein Distance is an N^2 algorithm and, well, yuck. >> >> Anyone have any ideas? I currently using an n^2 algorithm that scans >> through both strings and used run length matching, but if you know of >> something better, I'd love to hear it. > > I don't have anything faster, but you could, for example, apply > SOUNDEX to each word, sort the result, and count how many were > the same. > > You'll probably want an enhanced SOUNDEX... see > http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm As for soundex, I prefer metaphone over soundex. Conceptually, soundex is similar to a hash value, i.e. a pseudo-unique value best used for indexing where as metaphone breaks up the word into phoneitic components. Words parsed and sorted can tell you how many similar words there are, but can't really detail similarity instructure, there also needs to be a component that details how many words in a row match. Then there is matching "like" and "likely." Think about this: I have two database records: 'That was the year that was" and "EMI: That Was the year that was - Tom Lehrer" I should be able to look at that and say, with some level of confidence, that they are very close. Words repeat, and "that" and "was" are hardly significant in their comonality, but "that was the year that was" in that order is the real meat of the string.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |