Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
You have some choices: a) use something that you are familiar with, on the idea you will have a better chance of getting the implementation right. You may prefer n^2 to the effort involved in what I propose. b) reading techinical papers and using the latest and greatest technique. If you get can get yourself to the MIT library or somewhere with free access to IEEE and ACM web portals, you can scan recent conferences for relevant works. Keywords you should try include "nearest neighbor" and "fuzzy match". I recommend reading about VP-trees and MVP-trees (author Ozsoyoglu), which implement similarity search in amazing runtimes. You can also look up Sergey Brin's work on what he called GNATs. You may recognize Brin as one of the brains behind Google. SIGMOD and VLDB (Very Large Databases) seem to attract the best research that would be relevant to your question. Warning, those papers are dense and full of jargon. One I read recently (and reread) you may enjoy is: Robust and efficient fuzzy match for online data cleaning Proceedings of the 2003 ACM SIGMOD international conference on Management San Diego, California SESSION: Similarity queries I table of contents Pages: 313 - 324 Year of Publication: 2003 ISBN:1-58113-634-X Authors Surajit Chaudhuri Microsoft Research Kris Ganjam Microsoft Research Venkatesh Ganti Microsoft Research Rajeev Motwani Stanford University In case you haven't read these kind of papers, by "online" they mean that their algorithm works while the data is not yet completely received, which allows you to not just have your cpu sitting idle waiting for the rest of the data. This guy: http://www.cs.helsinki.fi/u/ukkonen/ has also done work relevant to what you're looking for. I specifically recommend: E. Ukkonen: Constructing suffix trees on-line in linear time. Proc. Information Processing 92, Vol. 1, IFIP Transactions A-12, 484-492, Elsevier 1992. That paper borders on magic. Finally, if you don't have a copy of Cormen, Leiverson, Rivest, and Stein's Algorithms book, you should. You may recognize the name Rivest from RSA and MD5. Chapter 15 talks about longest common subsequence. Any decent graduate level algorithms course mentions that book on the syllabus. -- David Backeberg (dave at math.mit.edu) On Thu, 18 May 2006, markw at mohawksoft.com wrote: > While I like to think I'm no slouch in this department, I'd like to see if > anyone has a better idea: > > You have two strings and you want to quantify how similar they are. > > Levenshtein Distance is not a good choice for two reasons: > > (1) it doesn't really show how similar strings are -- really, for instance: > > "Tom Lehrer - The Was The Year That Was" and "That Was The Year That Was - > Tom Lehrer" > > Are, to you and me, almost identical, but difficult to detect, Levenshtein > only shows how many steps are required to convert one to another. > > (2) Levenshtein Distance is an N^2 algorithm and, well, yuck. > > Anyone have any ideas? I currently using an n^2 algorithm that scans > through both strings and used run length matching, but if you know of > something better, I'd love to hear it. > > _______________________________________________ > Discuss mailing list > Discuss at blu.org > http://olduvai.blu.org/mailman/listinfo/discuss >
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |