BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Backing up sparse files ... VM's and TrueCrypt ... etc

Subject: Backing up sparse files ... VM's and TrueCrypt ... etc
From: tmetro-blu-5a1Jt6qxUNc at public.gmane.org (Tom Metro)
Date: Sat, 20 Feb 2010 18:48:38 -0500
In-reply-to: <000101cab101$250552f0$6f0ff8d0$@com>
References: <000001caaf83$31cb15d0$95614170$@com> <4B7B8BC6.4020709@vl.com> <000101caafd8$56834280$0389c780$@com> <4B7C4263.7050907@vl.com> <7C2EBBD9-C37C-4647-AC4B-B4EC0C0A056B@gmail.com> <4B7DB4E9.2030805@vl.com> <000101cab101$250552f0$6f0ff8d0$@com>

Richard Pieri wrote:
> Tom Metro wrote:
>>       -S, --sparse
>>              handle sparse files efficiently
> 
> That's the switch that instructs tar to scan every file twice to
> determine sparseness.  "handle... efficiently" means "don't fill up
> the archive with nulls".

Edward Ned Harvey wrote:
> The --sparse option of tar only seems to have any effect when you're
> extracting the tarball.  It will look for files with a lot of sequential
> zeros, as it extracts them, they will be extracted sparse.
> 
> When creating a tar backup of a sparse file ...
> It appears that tar tries to read the whole file, and during the zero
> sections, the system is just generating as many zeros as the CPU can
> generate.  This is surprisingly slow compared to what you'd think, but it is
> much faster than actually reading zeros from disk.

Ah, more details here in the GNU tar documentation:

http://www.gnu.org/software/tar/manual/html_section/Compression.html#SEC128

   Files in the file system occasionally have holes. A hole in a file
   is a section of the file's contents which was never written. The
   contents of a hole reads as all zeros. On many operating systems,
   actual disk storage is not allocated for holes, but they are counted
   in the length of the file. If you archive such a file, tar could
   create an archive longer than the original. To have tar attempt to
   recognize the holes in a file, use '--sparse' ('-S'). When you use
   this option, then, for any file using less disk space than would be
   expected from its length, tar searches the file for consecutive
   stretches of zeros. It then records in the archive for the file
   where the consecutive stretches of zeros are, and only archives the
   "real contents" of the file.

     '-S'
     '--sparse'

       This option instructs tar to test each file for sparseness
       before attempting to archive it. If the file is found to be
       sparse it is treated specially, thus allowing to decrease the
       amount of space used by its image in the archive.

   However, be aware that '--sparse' option presents a serious
   drawback. Namely, in order to determine if the file is sparse tar
   has to read it before trying to archive it, so in total the file is
   read twice. So, always bear in mind that the time needed to process
   all files with this option is roughly twice the time needed to
   archive them without it.

This hardly seems like a win to have to pass over files twice, first to 
see if they and up taking up more space than the OS reports, and a 
second time to "compact" the strings of zeros. You might as well just 
use a compression filter (gzip, bzip2), which can handle not only 
strings of zeros, but anything else that repeats, and tokenize it on the 
fly in a single pass.

This suggests that the --sparse option is effectively obsolete, given 
the modern practice of almost always compressing tar archives.

(Surprising to see they actually expended effort as recently as 2004 and 
2005 revising the way tar stores sparse files[1].)

   This option is meaningful only when creating or updating
   archives. It has no effect on extraction.

   On extraction (using '--sparse' is not needed on extraction) any such
   files have holes created wherever the continuous stretches of zeros
   were found.

That's a little misleading. It isn't like zeros are magically treated as 
special for writes with any file. What they mean is that tar sees the 
sparse flag in the archive header, and applies special case logic when 
writing the file, such that it seeks past strings of zeros.

I guess this would be why you'd still want to use the --sparse option, 
though it would be better if GNU tar added a "detect only" variant, 
which wouldn't compress the strings of zeros, but would detect when a 
file exceeds its allocated space and flag it as sparse in the headers. 
(This could be accomplished in a single pass, though probably 
challenging to do it in a portable way.)

This tar limitation seems to imply that there isn't an OS-level API for 
efficiently determining which sectors of a sparse file are in active 
use. (I'm assuming if one existed, tar would have made use of it.) 
Perhaps this is doable, but only using file system specific APIs? (At 
some level the file system obviously needs to know.)

This also explains why a tool like gzip doesn't have a sparse option. If 
there is no API to skip the zero sections, and it already efficiently 
stores repeated zeros, then there is no need to provide special handling 
for a sparse file. (Except for extraction, which apparently requires 
piping gzip output to cp --sparse=always[2,3]. Apparently there is no 
sparse flag in gzip file headers.)

> I didn't care to explore it, due to lack of ability to do
> incrementals.
> Tar does in fact do alright at backing up such files, and restoring them.
> But only by doing a Full every time.  Not incremental.

I'd say tar doesn't bring anything useful to this problem, given the 
above info. If you only have one sparse file, you can get the same 
benefit by simply doing:

gzip -c --rsyncable sparse_file > desparsed_filed

and then processing desparsed_filed with rsync or another tool that can 
perform incremental backups for portions of a file.

The prior info might also explain why rsync is slow in this situation. 
With your use case of a sparse file that's only about 10% used, and your 
point that it still takes time to process the zeros produced by the OS, 
which rsync then has to calculate an MD5 hash of, it can take a while.

Would pre-processing the file with gzip help any? Is it faster to md5 
hash zeros or compress zeros? Lets try an approximation:

% dd bs=1024 count=1024 if=/dev/zero of=/tmp/zeros
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0158239 s, 66.3 MB/s

% time md5sum /tmp/zeros
b6d81b360a5672d80c27430f39153e2c  /tmp/zeros
0.012u 0.000s 0:00.00 0.0%	0+0k 0+0io 0pf+0w

% time gzip -c /tmp/zeros > /dev/null
0.028u 0.000s 0:00.02 100.0%	0+0k 0+0io 0pf+0w

(I ran both multiple times and those numbers were the median values.) 
Looks like gzip is no help.

What you really need is a hypothetical sparse_cat that is file system 
aware and can efficiently skip over the unused sectors. Or better yet, 
the equivalent functionality built-in to your archiving tool.

http://lwn.net/Articles/260795/

   The process of recognizing holes is relatively primitive, though:
   about the only way to do it in a portable way is to simply look for
   blocks filled with zeroes. This technique works, but it requires
   making a pass over the data to obtain information which the lower
   levels of the system already know. It seems like there should be a
   better way.

   About two years ago, the Solaris ZFS developers proposed an extension
   to lseek() which would allow an application to find the holes in
   sparse files more efficiently. This extension works by adding two new
   "whence" options...
   [...]
   This functionality has been part of Solaris for a while; the Solaris
   developers would like to see it spread elsewhere and become something
   more than a Solaris-only extension. To that end, Josef Bacik has
   recently posted an implementation of this extension for Linux.
   Internally, it adds a new member to the file_operations structure
   (seek_hole_data()) intended to allow filesystems to efficiently
   implement the new operations. ...this patch looks relatively unlikely
   to make it into the mainline. The API is unpopular, being seen as ugly
   and as a change in the semantics of the lseek() call.

   ...there is already a proposed ioctl() command which can provide [more
   about the representation of a file than just where the holes are] That
   interface is the FIEMAP ioctl() specified by Andreas Dilger back in
   October [2007].

If any of these APIs have made it into modern Linux kernels, creating a 
sparse_cat or patching rsync shouldn't be too tough. Who knows, rsync 
may even use them already.

(I see rsync also has a --sparse option:
   -S, --sparse
     Try to handle sparse files efficiently so they take up less space
     on the destination.

but it sounds like the objective is to improve write-side storage 
efficiency (turn on the magic to seek past zeros), not read-side. (if 
you look at the source[4] there's a write_sparse(), but no read_sparse().))

Looks like there are equivalent APIs[5] for NTFS as well. (See section 
titled "Querying The Sparse File Layout.")

All of this is academic, as it doesn't solve your problem, but it does 
shed some light as to why these common tools don't provide better 
support for sparse files.

I also ran across a page specifically listing links on the topic of 
compressing and backing up sparse VM images:
http://blog.sharevm.com/2008/12/13/virtual-machine-disk-image-compression/

And "Multiple full VM backups using VCB, rsync, OpenSSH and VSS":
http://www.tienhuis.nl/archives/23

Basically they use a VMware tool to backup the VM image, and then rsync 
that backup file.

  -Tom

1. 
http://www.gnu.org/software/tar/manual/html_section/Sparse-Formats.html#SEC168
2. http://www.redhat.com/archives/ext3-users/2003-January/msg00013.html
3. http://www.thelinuxsociety.org.uk/content/gnunzip-to-a-sparse-file
4. http://gitweb.samba.org/?p=rsync.git;a=blob;f=rsync.c
5. http://www.flexhex.com/docs/articles/sparse-files.phtml

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

References:
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: tmetro-blu-5a1Jt6qxUNc at public.gmane.org (Tom Metro)
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: blu-Z8efaSeK1ezqlBn2x/YWAg at public.gmane.org (Edward Ned Harvey)
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: tmetro-blu-5a1Jt6qxUNc at public.gmane.org (Tom Metro)
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: richard.pieri-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org (Richard Pieri)
- Backing up sparse files ... VM's and TrueCrypt ... etc
  - From: tmetro-blu-5a1Jt6qxUNc at public.gmane.org (Tom Metro)

Prev by Date: Backing up sparse files ... VM's and TrueCrypt ... etc
Next by Date: reliable laptop battery vendors?
Previous by thread: Backing up sparse files ... VM's and TrueCrypt ... etc
Next by thread: Backing up sparse files ... VM's and TrueCrypt ... etc
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org