Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
Note: He is having problems with his email, and asked me to forward this to the list. --------------------------------------------------------- I have decided to provide this information to the hobbyist's at the BLU group. I have gotten mail from a few of you, and I realize that most of you are not representative of the people who attacked me. In the Bible, God tells a story about a farmer who plants a field. At night, while the farmer and his field hands are asleep, the farmer's enemy comes and sows weed seed and clearly intends to make it impossible for the farmer to raise his crop. In fact, in the morning this misdeed is discovered and the fieldhands ask the farmer if they should burn everything in the field up, thus burning up the good seed the farmer planted along with the foul seed planted by the farmer's enemy. But though the fieldhands were ready go forward, the farmer (who must have been very wise,) tells them not to do this, and proclaims that once his crop is grown, it will be easy to tell the weeds from the crop's planted by the farmer. The story ends with the farmer declaring that at harvest time the crop would be harvested and the weeds would be burned up... To me, it's clear that we live in that world. Now, about compression. I am the very same Jules Gilbert that declared to all the world (courtesy of comp.compression) that I had developed a repeatable compressor. Unfortunately, without so much as a "hello", I was jumped on and declared to be a liar. But I am not lying and the method I had developed (which was the basis for my claim) is generally described below. I say 'generally' because I've had some time to play with it and what I describe below is slightly improved. If you can program in C or Fortran you can easily code it up in a day or so. So get out your numerical methods texts and be off! Let's say that you have a file produced from the process of compression. Let's say that you segment this file into 256k byte blocks. Now, in the following I am going to describe how to compress (supposedly incompressible) data serially, one buffer at a time. This is nice, but the really big gains (using this method, which is not optimal at all, I have much better designs in terms of time and space performance,) are derived from processing ALL cell0's concurrently, then cell1's, etc... Howevcr, let's learn to crawl before we learn walk... First, let's say our example buffer begins with the values 1,29,3,7 Allocate a second buffer of 25k cells using unsigned longs, and serially add the values from the first buffer. So that the second buffer contains 1,30,33,40,... Notice, it's monotonic. Now we can curve fit the values. Use a high degree polynomial curve fitter. Say 30 terms. (The curve fitter I like gives me about 31 terms, maximum. It's not the NRC one which deliver's almost any desired degree of precision.) My suggestion is to play. Get it coded and run it, using different model parameters. You'll discover some very interesting things about the space gain performance measured just after the curve fit. Clearly the optimal way to do this is to do the fit ONCE on the entire file, then segment into smaller sizes. Continuing, so do the curve fit. 30 terms. And from the resulting coefficients, produce an approximation of the original input. And subtract the original monotonic data values from the approximated values. Do this subtraction in-place. Notice, the curve fitting operation requires 30 terms, each eight bytes. (That's 240 bytes which is pure overhead.) And look at the resulting unsigned values. They are no longer random in their left-most bit positions. In fact if you extract the left-most three or four significant bits you have yourself a very compressible signal. Just remember that what does not compress, you have to send. And sending costs -- that's the measure by which all compressors are judged; how much do you have to send. Notice that the left-most three significant bits don't begin in the same position in each of the 256k cells. However mapping the left-most bit position and sending this is very low-cost, roughly 500 bytes for a 256k buffer that's processed this way. (This is from memory, but I used to build these things, that's what I did.) The problem now is that after removing the significant left-most three bit's of non-random data from each column cell, we are left with data, that as you move from bit-column to bit-column from left to right, exhibits increasing randomness. The columns on the right remain completely random. We have a lot of options doing multiple 256k buffers concurrently. But for now, we do this; Take the right most columns and simply repackage the data. Reapply the process. But the left column's remaining, while not of sufficient quality to send to a conventional good quality compressor, can be compressed. Here is one way that works. Make an RLE list from each bit-column, and compress those, individually. They are not random, they will compress. Well? No. Please note: If you try to compress these values before doing the RLE's you will not obtain any compression. Do the RLE's. Now, let's talk about cost and compute the savings. (This is per pass:) 240 bytes for the coefficients. 500 bytes for tracking the three-bit compressible column fragment. We save three bits times 256k cells. That's almost 100k but we have to compress it, and the savings is actually only about 40k bytes. But we are not done... Now we have to account for those bit columns we compressed by first making them into an RLE list. I found that even slight changes in the type of curve fitter produced widely variant number's of columns. But this is the gestalt of my learning: Expect about 5k, maybe 7k compression per logical pass. Not a lot, I know. A friend gave me a special fitter (from the NR book, I think...) which produced almost any desired degree of precision. It took forever, but it worked. That just doubled the savings, say to 10% or very close. Once I had this running I knew that what people had inferred from Shannon wasn't true. And while this method doesn't have any merit, (because of the huge CPU overhead necessary to do the curve fitting,) other systems, not based on floating point mechanics or involving integer DSP logic work well, Unfortunately, these are all bit based and, running on an X86 box, do require lot's of machine time. Of course, this is a special case, because this process involves a step that requires lot's of CPU time! As I keep saying, it's not practical. But does it work? Absolutely. What I've just described is pretty easy to build. Anyone with a copy of the NRC or NRF should have no trouble constructing somethine within a day or so. My first one was 64k, not in C, and it barely worked. It took almost a full day because the atomic instructions to read and write the buffer values were within CLIPS, an expert system shell. When I went to C I discovered that 8k was probably about as thin as I could make things and still get compression. Try 1-bit wide buffers with 8k cells. Thin! Oh yes. that might even be practical but I discovered several bit-based architectures that deliver the goods without curve fitting, -- DDDD DK KD I'm good at 'hello'. DKK D not great at 'how are you'. DK KD and I suck at 'goodbye.' DDDD - Joss Whedon
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |