Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] vnc

> From: at [mailto:discuss-
> at] On Behalf Of Dan Ritter
> Shannon's experimental tests are reported and repeatable, and
> have in fact been repeated several times.

I'm on vacation, and this is fun for me, so I followed google on this, and found a convenient java app that you can interact with (  They select a sentence from their database, and you can sit there and guess characters until you have completed the puzzle.  Most people will apply a little intelligence, prioritizing RSTLN E, but I went a step further than that - I wrote a trivial helper app to help me - it takes a word list and it takes the known characters so far already known for a given word, and displays the frequency count of each possible next character, assisting me as an English language character frequency predictor.  I take this under advisement, and I also apply my brain and knowledge of the structure of English.  I make my best guess to each next character.

My puzzle was 139 characters long, I completed it in 380 total guesses and 241 incorrect guesses.  The measured Shannon entropy was 1.95 bits per character.

According to this estimate, 1.95 bits per character * 139 characters = 271.05 bits of entropy in the sentence.  I found this to be unbelievably high (my reasons for disbelief are complex enough I'd have an extremely tough time writing them down in an email), so I did this test:

Suppose there is a sentence, "your mother wears combat boots."   Suppose an adversary knows this sentence.  Then I think we'll all agree, the sentence has 0 bits of entropy, because entropy is a measure of how many guesses an adversary would need to make in order to correctly guess the unknown - and there is no unknown.  Now suppose the adversary only knows "your __ther wears combat boots" and suppose the adversary knows the correct answer is either "fa" or "mo".  Then I think we'll all agree the sentence has 1 bit of entropy, which would be 0.0333 bits per character (1/30).

So I went back to that app and re-typed the sentence with no errors.  This should result in 0 bits entropy, and indeed it does.

Then I re-typed it with 1 error.  This should result in 0.007 bits per character (1 bit in 139 characters), but actually results in 0.06.  Which confirms my suspicion that it greatly over estimates the amount of entropy per character.

I don't know *how* or *why* their equation is inaccurate.  I only know it is inaccurate.

For a little more fun, let's assume I'm somehow wrong in the above.  (But I'm not.)  Let's come up with another technique of estimating:

In the GSL, there are 2284 words, average length 5.84 characters, so if you randomly select a word, it's about 11.15 bits of entropy per word, and an average of 1.91 bits of entropy per character.  This establishes a theoretical upper bound for number of bits entropy per character.  Pure random words with absolutely no language structure yield 1.91 bits of entropy per character.

But if you add on every valid prefix, suffix, tense, conjugation, etc, then the total is 109,582 valid word permutations with average length 9.53 characters, so if you randomly select from this larger list, you have around 16.74 bits of entropy in the word, and 1.75 bits of entropy per character.  This is lower than the entropy per character in the GSL, because when we added all the permutations of words to the dictionary, the number of bits of entropy per word grew slowly relative to the length of the word.

This sets the theoretical upper bound for number of entropy bits per character - around 1.75 or 1.91.  Pure random.  Best case.

Since the result of me intelligently playing the Shannon guessing game described above resulted in 1.95 which is higher than the theoretical maximum for pure randomness, and in fact I was able to predict many of the words and characters which means the actual entropy per character should be much lower than the theoretical max, this shows solidly that the Shannon estimator grossly overestimates the number of entropy bits per character in a structured sentence.

BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!

Boston Linux & Unix /