endless compression

markw at mohawksoft.com markw at mohawksoft.com
Fri Aug 13 09:31:01 EDT 2004


markw at mohawksoft.com wrote:
>
>> When you look at a ZIP compressed file, it does not compress well using
>> standard stream compressors. This is because there are few if any
.> repeating paterns. The fact that it has few repeating patterns, makes
>this
>> stream interresting. It means that it is less likely in a universe of
>> streams. If it is less likely, then it can be represented with fewer
>bits.

>I don't see why it matters how "likely" the stream is.  Even if the
>stream I want to compress is extremely unlikely, there might be some
>very large number of other streams that are equally unlikely.

Not the point, if you have a repeatable compressor, then compression ratio
is largely unimportant. If you can compress something 3% over and over
again, then you can make it really small.

The practical aspects of this, i.e. representation of each iteration, may
make it impractical, but that is implementation not theory.


>If your recompression engine can say "aha, even though this stream does
>not have any 'repeating patterns' that a ZIP compressor can recognize,
>the data in the stream could be reconstituted with the following
>formula...", then the engine could write out some abbreviation that
>formula, and the corresponding uncompression engine could read the
>abbreviation and regenerate the stream.
>
>Of course, you might as well apply that pattern-recognition technology
>to the original document, rather than the ZIPped version, right?

It works on a zipped version because we know something about zip files,
they have very few repeating sequences.


>And once you reach the point where the formula to reconstitute the
>stream takes up just as many bits as the original stream, then you lose.

That's exactly what I said at the end of the post to which you replied.

Like I said, I'm not saying it *is* possible, I'm just saying it *may be*
possible. It is an interesting theory. If it is possible, it can mean HUGE
things for long term archival of data.



More information about the Discuss mailing list