Tool for identifying languages
John Chambers
jc at trillian.mit.edu
Tue Jan 17 14:10:28 EST 2006
Jeff Kinz wrote:
| On Tue, Jan 17, 2006 at 10:23:17AM -0500, Christopher Schmidt wrote:
| > http://languid.cantbedone.org/
| > http://languid.cantbedone.org/Language-Guess.tgz
...
| Why I'm "wowed":
|
| This tool appears to use some form of statistical analysis based on
| how often certain three "character" strings appear. Also, whitespace is
| one of the characters. Very nice, and thanks again to Chris.
|
| Here's a few random lines of the English "strings" file:
| t t 45
| be 46
| ld 47
| e a 48
| rs 49
| wa 50
| ut 51
| ve 52
| ll 53
This works better than most people would believe. Some years back, I
had a bit of fun with it at a place that I worked. I wrote a litte
program to collect these trigraph statistics and fed it a stack of
company email memos. Then I wrote another program that generated
pseudo-random text with the same statistics. This output got piped to
another program that added random punctuation and capitolization with
stats from the same source. Another program added email headers and
sent the results out to a mailing list.
The recipients really loved the results. I heard people reading them
to each other, and breaking out laughing. Several ended up on
bulletin boards in the hallways.
I also tried it with 4-char sequences, and it was interesting that
the results weren't much funnier. More of the words were real English
words. But even with the 3-char case, almost all the words that came
out were pronouncable and looked like they could be English words.
I also generated a few man pages with the same programs, using the
online unix manuals for the statistics. With those, the 4-char
statistics worked better, because they picked up a lot of the unix
tech terms and phrases, and mixed them in pseudo-randomly among the
pseudo-English words.
The joke is only funny for a short time, though, and quickly becomes
rather repetitive. Part of the reason that Jabberwocky has been such
a success is that it's short. An epic poem in the same style would
put you to sleep after a while.
More information about the Discuss
mailing list