BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tool for identifying languages

Subject: Tool for identifying languages
From: jc at trillian.mit.edu (John Chambers)
Date: Tue, 17 Jan 2006 19:10:28 UTC
In-reply-to: 20060117125754.B17487@redline.comcast.net
References: 20060117125754.B17487@redline.comcast.net, <20060117093406.A17487@redline.comcast.net> <20060117152317.GC28663@crschmidt.net>

Jeff Kinz wrote:
| On Tue, Jan 17, 2006 at 10:23:17AM -0500, Christopher Schmidt wrote:
| > http://languid.cantbedone.org/
| > http://languid.cantbedone.org/Language-Guess.tgz
...
| Why I'm "wowed":
|
| This tool appears to use some form of statistical analysis based on
| how often certain three "character" strings appear.  Also, whitespace is
| one of the characters.   Very nice, and thanks again to Chris.
|
| Here's a few random lines of the English "strings" file:
| t t                     45
|  be                     46
| ld                      47
| e a                     48
| rs                      49
|  wa                     50
| ut                      51
| ve                      52
| ll                      53


This works better than most people would believe.  Some years back, I
had  a  bit of fun with it at a place that I worked.  I wrote a litte
program to collect these trigraph statistics and fed it  a  stack  of
company  email  memos.   Then  I wrote another program that generated
pseudo-random text with the same statistics. This output got piped to
another program that added random punctuation and capitolization with
stats from the same source.  Another program added email headers  and
sent the results out to a mailing list.

The recipients really loved the results.  I heard people reading them
to  each  other,  and  breaking  out  laughing.   Several ended up on
bulletin boards in the hallways.

I also tried it with 4-char sequences, and it  was  interesting  that
the results weren't much funnier. More of the words were real English
words.  But even with the 3-char case, almost all the words that came
out were pronouncable and looked like they could be English words.

I also generated a few man pages with the same  programs,  using  the
online  unix  manuals  for  the  statistics.   With those, the 4-char
statistics worked better, because they picked up a lot  of  the  unix
tech  terms  and phrases, and mixed them in pseudo-randomly among the
pseudo-English words.

The joke is only funny for a short time, though, and quickly  becomes
rather repetitive.  Part of the reason that Jabberwocky has been such
a success is that it's short.  An epic poem in the same  style  would
put you to sleep after a while.

References:
- Tool for identifying languages
  - From: jkinz at kinz.org (Jeff Kinz)
- Tool for identifying languages
  - From: crschmidt at crschmidt.net (Christopher Schmidt)

Prev by Date: Tool for identifying languages
Next by Date: Comcast email bounces
Previous by thread: Tool for identifying languages
Next by thread: Comcast email bounces
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org