Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] rsync v. cp in data migration



On Thu, May 23, 2013 at 02:41:42PM -0400, Richard Pieri wrote:
> Greg Rundlett (freephile) wrote:
> > Would  cp
> >   cp -au /source/* /target
> > be preferable to rsync?
> >   rsync -vazn --checksum --progress --stats /source/ dest/
> 
> I wouldn't use either of these as written. cp is slow (inefficient I/O
> buffering), and these rsync options don't handle sparse files and hard
> links correctly.

Inefficient buffering?  I'm fairly certain your assertion is false, so
let's test that hypothesis and get some real data.  I have only Linux
systems available for this test at the moment.... I would be
interested to see how other Unix variants stack up with this same
test.  I expect the results to be similar, but I've been wrong
before...

First, I created a 4GB file in /tmp, the purpose of which is to
counteract the effects of buffer caching on the test.  We will read it
using dd in between copies, which should cause the kernel to replace
all of the data buffered in cache with the data from the large file.
We'll know if this fails to have the intended effect, if we see the
times for successive runs drop substantially.  I also chose to run cp
first, so that if buffer caching DID impact the test, the tar would
benefit with faster copy times.

-=-=-=-=-=-
$ dd if=/dev/urandom of=/tmp/large_file bs=1048576 count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 536.667 s, 8.0 MB/s
-=-=-=-=-=-

Then I made sure I had some reasonable data in my $HOME/tmp directory.
It contains a 1GB file which I created explicitly for this test, as
well as a fair number of other, smaller files.

Then I created the script shown below.  Note that for both cp and tar,
I'm using roughly equivalent options: both commands will preserve
sparse files, and permissions, etc..

-=-=-=-=-=-
$ cat cpscript.sh
#!/bin/sh

# Display the size of the data we're going to copy
du -sh tmp

echo "Clearing buffer cache..."
dd if=/tmp/large_file of=/dev/null bs=1048576 count=4096 >/dev/null 2>&1
# The dd command is very I/O intensive; sleep before we do the copy, to
# allow the system to catch up other processes that may briefly
# stampede now that it's completed
sleep 5
echo "Running cp..."
# copy, preserving permissions and file sparseness
time cp -rp --sparse=auto tmp tmp2

# rinse and repeat with tar... target dir must exist already
mkdir tmp3
echo "Clearing buffer cache..."
dd if=/tmp/large_file of=/dev/null bs=1048576 count=4096  >/dev/null 2>&1
# sleep before we do the copy, to allow the system to catch up other processes
sleep 5
echo "Running tar..."
time sh -c "tar -C tmp -cSf - . 2>/dev/null| tar -C tmp3 -xpf -"

# clean up
rm -rf tmp2 tmp3

-=-=-=-=-=-


Next, I ran the script in a loop, to get multiple data points.  I did
not use the system while the test was ongoing, so as to minimize the
effect of other processes on the test.  Here's the result:


-=-=-=-=-=-
$ for i in `seq 1 5`; do echo '===> Run #'"$i"; ./cpscript.sh; done
===> Run #1
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m36.062s
user    0m0.020s
sys     0m2.930s
Clearing buffer cache...
Running tar...

real    0m43.268s
user    0m0.110s
sys     0m3.940s
===> Run #2
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m36.673s
user    0m0.000s
sys     0m2.910s
Clearing buffer cache...
Running tar...

real    0m37.729s
user    0m0.100s
sys     0m4.070s
===> Run #3
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m36.584s
user    0m0.000s
sys     0m2.890s
Clearing buffer cache...
Running tar...

real    0m42.889s
user    0m0.080s
sys     0m4.080s
===> Run #4
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m38.155s
user    0m0.010s
sys     0m2.930s
Clearing buffer cache...
Running tar...

real    0m44.240s
user    0m0.110s
sys     0m4.020s
===> Run #5
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m35.885s
user    0m0.000s
sys     0m2.920s
Clearing buffer cache...
Running tar...

real    0m42.644s
user    0m0.100s
sys     0m4.090s
===> Run #4
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m38.155s
user    0m0.010s
sys     0m2.930s
Clearing buffer cache...
Running tar...

real    0m44.240s
user    0m0.110s
sys     0m4.020s
===> Run #5
1.2G    tmp
Clearing buffer cache...
Running cp...

real    0m35.885s
user    0m0.000s
sys     0m2.920s
Clearing buffer cache...
Running tar...

real    0m42.644s
user    0m0.100s
sys     0m4.090s
=-=-=-=-=-

In point of fact, cp beats tar every time.  By a substantial amount:
more than 5s per gigabyte.  Copy a terabyte of data, and you're talking
around 90 minutes of difference.

My expectation is that cp is more efficient at copying large files
than tar is--ESPECIALLY if it uses MMIO (which it appears Tom has
disproven, but)...  MMIO generally performs better than read()/write()
copies (assuming the OS's virtual memory managment is not utterly
broken) because the latter will double-buffer every I/O, AND because
the former uses direct memory access (e.g. memcpy()), whereas each
read() and write() suffer the overhead that all system calls suffer,
every time they're called.  That's more than enough to offset the
set-up and tear-down of the mmapped segments, and any minor page
faults incurred, by a very substantial margin.  See Stevens, Advanced
Programming in the Unix Environment, Ch 12, p.411.  The only time MMIO
should lose is if your system is under very heavy memory pressure
already, or if your OS's VMM is garbage.  That benefit is lessened if
your data consists of a large number of small files; but the
performance of MMIO should at least be roughly on par with
read()/write() in that case, and may still be better.  Unless, again,
your system is broken or lacks adequate memory.

Nevertheless, as Tom points out, that seems not in play here.

Even still, I reran the test after removing the 1GB file, and frankly
I expected tar to compare more favorably, but it did not.  With that
data, cp completed on average in about 2.95s, and tar completed in
about 3.2s.  The results were fairly consistent across runs, except
that in one run, tar did beat cp; but given the other results, this is
probably explained away as random noise from the system running
background tasks.


-- 
Derek D. Martin    http://www.pizzashack.org/   GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address.  Replying to it will result in
undeliverable mail due to spam prevention.  Sorry for the inconvenience.




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org