![]() |
Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
> Edward Ned Harvey wrote: > > . Never use --sparse when creating an archive that is > > compressed. It's pointless, and doubles the time to create archive. > > > > . Yes, use --sparse during extraction, if the contents > contain a > > lot of serial 0's and you want the files restored to a sparse state. > > > > The man page saying "using '--sparse' is not needed on extraction" is > > misleading. It's technically true - you don't need it - but it's > > misleading - yes you need it if you want the files to be extracted > sparsely. > > Have you confirmed that through code inspection or experimentation? I'll test it right now... I have a 400Mb sparse file, junk.tc, occupying 1.06 Mb on disk. $ time tar cf - junk.tc | gzip --fast > junk.tc.tar.gz real 0m3.688s (junk.tc.tar.gz is 2.79 Mb) $ time tar cf - --sparse junk.tc | gzip --fast > junk.tc.sparse.tar.gz real 0m33.130s (junk.tc.sparse.tar.gz is 1.04 Mb) If I extract the non-sparse tar.gz file as non-sparse ... I get a nonsparse result. As expected. If I extract the non-sparse tar.gz file as sparse ... I get a nonsparse result. Bah. If I extract the sparse tar.gz, without using the --sparse switch ... I get a sparse file. Apparently I was wrong. Apparently you have no choice about it. If you want to backup a sparse file with tar, you have to waste a bunch of time, and you have to use the --sparse option during archive creation. > Also consider that the code to detect strings of zeros seems to be on > the read side (based on the man page description). On extraction, it > wouldn't make sense to expand the unused portions to strings of zeros, > then follow that by code that detects the zeros and seeks past them to > write a sparse file. My expectation was: on extraction, detect strings of 0's, and make them holes. > > ...you may be overestimating the time to read or md5sum all the 0's > > in the hole of sparse files. > > Perhaps, but... > > > The hypothetical sparse_cat would improve performance, but just > > marginally. > > ...it would eliminate the need for a two-pass read with tar. And if > summing zeros is fast, why is rsync so slow in your experiments? Well, I've demonstrated you can sum the 0's very quickly, but I don't yet know why rsync stinks at this. > (A literal sparse_cat (drop-in replacement for cat) wouldn't actually > be > that useful, as you need to communicate to the process receiving the > stream the byte offset for each chunk of data, assuming you want to be > able to reconstruct the sparse file later with the same holes. So > practically speaking, this is something you'd have to integrate into > tar, gzip, rsync, or whatever archiver you're using. > > It sounds like it would be a small project to patch tar to use the > fcntl, as it already has a data structure figured out for recording the > holes. But you'd still need additional hacks to do incremental > transfers. So the bigger win would be patching rsync.) I do plan on writing an experimental python script. Not that it'll actually be useful, but at least prove the concept. And then maybe the rsync guys will care. Don't know.