Any computer science geeks?

Thu Jun 8 14:37:58 EDT 2006

You have some choices:
a) use something that you are familiar with, on the idea you will have a 
better chance of getting the implementation right. You may prefer n^2 to 
the effort involved in what I propose.

b) reading techinical papers and using the latest and greatest technique.

If you get can get yourself to the MIT library or somewhere with free 
access to IEEE and ACM web portals, you can scan recent conferences for 
relevant works. Keywords you should try include "nearest neighbor" and 
"fuzzy match".

I recommend reading about VP-trees and MVP-trees (author Ozsoyoglu), which 
implement similarity search in amazing runtimes. You can also look up 
Sergey Brin's work on what he called GNATs. You may recognize Brin as one 
of the brains behind Google.

SIGMOD and VLDB (Very Large Databases) seem to attract the best research 
that would be relevant to your question. Warning, those papers are dense 
and full of jargon. One I read recently (and reread) you may enjoy is:

Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management 
San Diego, California
SESSION: Similarity queries I table of contents
Pages: 313 - 324  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors 	
Surajit Chaudhuri 	 Microsoft Research
Kris Ganjam 	 Microsoft Research
Venkatesh Ganti 	 Microsoft Research
Rajeev Motwani 	 Stanford University

In case you haven't read these kind of papers, by "online" they mean that 
their algorithm works while the data is not yet completely received, which 
allows you to not just have your cpu sitting idle waiting for the rest of 
the data.

This guy:
http://www.cs.helsinki.fi/u/ukkonen/
has also done work relevant to what you're looking for. I specifically 
recommend:
E. Ukkonen: Constructing suffix trees on-line in linear time. Proc. 
Information Processing 92, Vol. 1, IFIP Transactions A-12, 484-492, 
Elsevier 1992.
That paper borders on magic.

Finally, if you don't have a copy of Cormen, Leiverson, Rivest, and 
Stein's Algorithms book, you should. You may recognize the name Rivest 
from RSA and MD5. Chapter 15 talks about longest common subsequence. Any 
decent graduate level algorithms course mentions that book on the 
syllabus.

-- 
David Backeberg (dave at math.mit.edu)

On Thu, 18 May 2006, markw at mohawksoft.com wrote:

> While I like to think I'm no slouch in this department, I'd like to see if
> anyone has a better idea:
> 
> You have two strings and you want to quantify how similar they are.
> 
> Levenshtein Distance is not a good choice for two reasons:
> 
> (1) it doesn't really show how similar strings are -- really, for instance:
> 
> "Tom Lehrer - The Was The Year That Was" and "That Was The Year That Was -
> Tom Lehrer"
> 
> Are, to you and me, almost identical, but difficult to detect, Levenshtein
> only shows how many steps are required to convert one to another.
> 
> (2) Levenshtein Distance is an N^2 algorithm and, well, yuck.
> 
> Anyone have any ideas? I currently using an n^2 algorithm that scans
> through both strings and used run length matching, but if you know of
> something better, I'd love to hear it.
> 
> _______________________________________________
> Discuss mailing list
> Discuss at blu.org
> http://olduvai.blu.org/mailman/listinfo/discuss
>