Any computer science geeks?
David Backeberg
dave at math.mit.edu
Thu Jun 8 14:37:58 EDT 2006
You have some choices:
a) use something that you are familiar with, on the idea you will have a
better chance of getting the implementation right. You may prefer n^2 to
the effort involved in what I propose.
b) reading techinical papers and using the latest and greatest technique.
If you get can get yourself to the MIT library or somewhere with free
access to IEEE and ACM web portals, you can scan recent conferences for
relevant works. Keywords you should try include "nearest neighbor" and
"fuzzy match".
I recommend reading about VP-trees and MVP-trees (author Ozsoyoglu), which
implement similarity search in amazing runtimes. You can also look up
Sergey Brin's work on what he called GNATs. You may recognize Brin as one
of the brains behind Google.
SIGMOD and VLDB (Very Large Databases) seem to attract the best research
that would be relevant to your question. Warning, those papers are dense
and full of jargon. One I read recently (and reread) you may enjoy is:
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management
San Diego, California
SESSION: Similarity queries I table of contents
Pages: 313 - 324
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Surajit Chaudhuri Microsoft Research
Kris Ganjam Microsoft Research
Venkatesh Ganti Microsoft Research
Rajeev Motwani Stanford University
In case you haven't read these kind of papers, by "online" they mean that
their algorithm works while the data is not yet completely received, which
allows you to not just have your cpu sitting idle waiting for the rest of
the data.
This guy:
http://www.cs.helsinki.fi/u/ukkonen/
has also done work relevant to what you're looking for. I specifically
recommend:
E. Ukkonen: Constructing suffix trees on-line in linear time. Proc.
Information Processing 92, Vol. 1, IFIP Transactions A-12, 484-492,
Elsevier 1992.
That paper borders on magic.
Finally, if you don't have a copy of Cormen, Leiverson, Rivest, and
Stein's Algorithms book, you should. You may recognize the name Rivest
from RSA and MD5. Chapter 15 talks about longest common subsequence. Any
decent graduate level algorithms course mentions that book on the
syllabus.
--
David Backeberg (dave at math.mit.edu)
On Thu, 18 May 2006, markw at mohawksoft.com wrote:
> While I like to think I'm no slouch in this department, I'd like to see if
> anyone has a better idea:
>
> You have two strings and you want to quantify how similar they are.
>
> Levenshtein Distance is not a good choice for two reasons:
>
> (1) it doesn't really show how similar strings are -- really, for instance:
>
> "Tom Lehrer - The Was The Year That Was" and "That Was The Year That Was -
> Tom Lehrer"
>
> Are, to you and me, almost identical, but difficult to detect, Levenshtein
> only shows how many steps are required to convert one to another.
>
> (2) Levenshtein Distance is an N^2 algorithm and, well, yuck.
>
> Anyone have any ideas? I currently using an n^2 algorithm that scans
> through both strings and used run length matching, but if you know of
> something better, I'd love to hear it.
>
> _______________________________________________
> Discuss mailing list
> Discuss at blu.org
> http://olduvai.blu.org/mailman/listinfo/discuss
>
More information about the Discuss
mailing list