Backing up sparse files ... VM's and TrueCrypt ... etc
Edward Ned Harvey
blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org
Sun Feb 21 20:43:07 EST 2010
> Edward Ned Harvey wrote:
> > . Never use --sparse when creating an archive that is
> > compressed. It's pointless, and doubles the time to create archive.
> >
> > . Yes, use --sparse during extraction, if the contents
> contain a
> > lot of serial 0's and you want the files restored to a sparse state.
> >
> > The man page saying "using '--sparse' is not needed on extraction" is
> > misleading. It's technically true - you don't need it - but it's
> > misleading - yes you need it if you want the files to be extracted
> sparsely.
>
> Have you confirmed that through code inspection or experimentation?
I'll test it right now...
I have a 400Mb sparse file, junk.tc, occupying 1.06 Mb on disk.
$ time tar cf - junk.tc | gzip --fast > junk.tc.tar.gz
real 0m3.688s
(junk.tc.tar.gz is 2.79 Mb)
$ time tar cf - --sparse junk.tc | gzip --fast > junk.tc.sparse.tar.gz
real 0m33.130s
(junk.tc.sparse.tar.gz is 1.04 Mb)
If I extract the non-sparse tar.gz file as non-sparse ... I get a nonsparse
result. As expected.
If I extract the non-sparse tar.gz file as sparse ... I get a nonsparse
result. Bah.
If I extract the sparse tar.gz, without using the --sparse switch ... I get
a sparse file.
Apparently I was wrong. Apparently you have no choice about it. If you
want to backup a sparse file with tar, you have to waste a bunch of time,
and you have to use the --sparse option during archive creation.
> Also consider that the code to detect strings of zeros seems to be on
> the read side (based on the man page description). On extraction, it
> wouldn't make sense to expand the unused portions to strings of zeros,
> then follow that by code that detects the zeros and seeks past them to
> write a sparse file.
My expectation was: on extraction, detect strings of 0's, and make them
holes.
> > ...you may be overestimating the time to read or md5sum all the 0's
> > in the hole of sparse files.
>
> Perhaps, but...
>
> > The hypothetical sparse_cat would improve performance, but just
> > marginally.
>
> ...it would eliminate the need for a two-pass read with tar. And if
> summing zeros is fast, why is rsync so slow in your experiments?
Well, I've demonstrated you can sum the 0's very quickly, but I don't yet
know why rsync stinks at this.
> (A literal sparse_cat (drop-in replacement for cat) wouldn't actually
> be
> that useful, as you need to communicate to the process receiving the
> stream the byte offset for each chunk of data, assuming you want to be
> able to reconstruct the sparse file later with the same holes. So
> practically speaking, this is something you'd have to integrate into
> tar, gzip, rsync, or whatever archiver you're using.
>
> It sounds like it would be a small project to patch tar to use the
> fcntl, as it already has a data structure figured out for recording the
> holes. But you'd still need additional hacks to do incremental
> transfers. So the bigger win would be patching rsync.)
I do plan on writing an experimental python script. Not that it'll actually
be useful, but at least prove the concept. And then maybe the rsync guys
will care. Don't know.
More information about the Discuss
mailing list