Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On Thu, Aug 12, 2004 at 03:54:31PM -0400, markw at mohawksoft.com wrote: > (Let me preface this with some background, I've written a few format > specific compressors, and have implemented a few standard compressors.) I shudder to think about them. > Lets talk about *what* compression is. Compression is representing a > theoretically larger range of data with a smaller range. Compression is > removing "knowledge" about the data from the data, leaving only the > "interesting" part that is needed to reconstitute the data based on > knowledge. Bzzt, wrong. You've just described "lossy compression". Lossy compression throws out information. The file which is reconstructed is NOT the same as the original file; it is an approximation which in some domains (notably audio and video) may be considered "good enough". Lossless compression does not destroy information. Lossless compression is finding a more compact notation to express exactly the same information. It removes redundancies and replaces them with an explanation of how to replace them exactly. > The idea, then, is to take knowledge of a numerical series or knowledge > about data. The theoretical number range may be X but in reality there is > only Y variation, and if Y < X, the data should be compressable. You have now taken an example of lossless and an example of lossy and announced that lossy is always acceptable. It isn't. > When you look at a ZIP compressed file, it does not compress well using > standard stream compressors. This is because there are few if any > repeating paterns. The fact that it has few repeating patterns, makes this > stream interresting. It means that it is less likely in a universe of > streams. If it is less likely, then it can be represented with fewer bits. OK, you've defined "information" the way Shannon did. Big deal. > Standard compressors work on the notion that repeated patterns can be > saved once and re-used when they show up again. LZW actually uses more > bits per byte for each byte, but many fewer bits for strings of repeated > bytes. Erm... kind of. LZW uses a dictionary-and-quoting method. Repeated byte patterns are stored in the dictionary, taking up slightly more space than a single copy would, and then a symbol for the pattern -- a reference to the dictionary -- is inserted. But LZW is smart enough to simply store quoted information -- non-referenceable strings -- as themselves. Most of the improvements in this system have come from adjustments in dictionary hashing, look-ahead window size and sub-pattern matching. > A "re-compressor" is different, rather than "repeated" bytes, it would use > information about special sequences of data that have little or no > repitition and few recognizable patterns. Where does this information come from? > This may be hugely expensive to > compute and offer only a compressability of a few percent, however, if it > repeatable, Then it didn't work. > then it can iterate and iterate untill the overhead for the > compression is larger than the data. > I'm not saying it is practical, workable, or really possible, I'm just > saying that, in theory, it may be possible. Can I accurately rephrase that as "I'm not saying that it is possible, I'm saying that it may be possible."? OK, you and Jules are cranks. Go read the sci.compression FAQ a few more times, and leave us alone. -dsr-
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |