Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
On Thu, May 23, 2013 at 02:41:42PM -0400, Richard Pieri wrote: > Greg Rundlett (freephile) wrote: > > Would cp > > cp -au /source/* /target > > be preferable to rsync? > > rsync -vazn --checksum --progress --stats /source/ dest/ > > I wouldn't use either of these as written. cp is slow (inefficient I/O > buffering), and these rsync options don't handle sparse files and hard > links correctly. Inefficient buffering? I'm fairly certain your assertion is false, so let's test that hypothesis and get some real data. I have only Linux systems available for this test at the moment.... I would be interested to see how other Unix variants stack up with this same test. I expect the results to be similar, but I've been wrong before... First, I created a 4GB file in /tmp, the purpose of which is to counteract the effects of buffer caching on the test. We will read it using dd in between copies, which should cause the kernel to replace all of the data buffered in cache with the data from the large file. We'll know if this fails to have the intended effect, if we see the times for successive runs drop substantially. I also chose to run cp first, so that if buffer caching DID impact the test, the tar would benefit with faster copy times. -=-=-=-=-=- $ dd if=/dev/urandom of=/tmp/large_file bs=1048576 count=4096 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 536.667 s, 8.0 MB/s -=-=-=-=-=- Then I made sure I had some reasonable data in my $HOME/tmp directory. It contains a 1GB file which I created explicitly for this test, as well as a fair number of other, smaller files. Then I created the script shown below. Note that for both cp and tar, I'm using roughly equivalent options: both commands will preserve sparse files, and permissions, etc.. -=-=-=-=-=- $ cat cpscript.sh #!/bin/sh # Display the size of the data we're going to copy du -sh tmp echo "Clearing buffer cache..." dd if=/tmp/large_file of=/dev/null bs=1048576 count=4096 >/dev/null 2>&1 # The dd command is very I/O intensive; sleep before we do the copy, to # allow the system to catch up other processes that may briefly # stampede now that it's completed sleep 5 echo "Running cp..." # copy, preserving permissions and file sparseness time cp -rp --sparse=auto tmp tmp2 # rinse and repeat with tar... target dir must exist already mkdir tmp3 echo "Clearing buffer cache..." dd if=/tmp/large_file of=/dev/null bs=1048576 count=4096 >/dev/null 2>&1 # sleep before we do the copy, to allow the system to catch up other processes sleep 5 echo "Running tar..." time sh -c "tar -C tmp -cSf - . 2>/dev/null| tar -C tmp3 -xpf -" # clean up rm -rf tmp2 tmp3 -=-=-=-=-=- Next, I ran the script in a loop, to get multiple data points. I did not use the system while the test was ongoing, so as to minimize the effect of other processes on the test. Here's the result: -=-=-=-=-=- $ for i in `seq 1 5`; do echo '===> Run #'"$i"; ./cpscript.sh; done ===> Run #1 1.2G tmp Clearing buffer cache... Running cp... real 0m36.062s user 0m0.020s sys 0m2.930s Clearing buffer cache... Running tar... real 0m43.268s user 0m0.110s sys 0m3.940s ===> Run #2 1.2G tmp Clearing buffer cache... Running cp... real 0m36.673s user 0m0.000s sys 0m2.910s Clearing buffer cache... Running tar... real 0m37.729s user 0m0.100s sys 0m4.070s ===> Run #3 1.2G tmp Clearing buffer cache... Running cp... real 0m36.584s user 0m0.000s sys 0m2.890s Clearing buffer cache... Running tar... real 0m42.889s user 0m0.080s sys 0m4.080s ===> Run #4 1.2G tmp Clearing buffer cache... Running cp... real 0m38.155s user 0m0.010s sys 0m2.930s Clearing buffer cache... Running tar... real 0m44.240s user 0m0.110s sys 0m4.020s ===> Run #5 1.2G tmp Clearing buffer cache... Running cp... real 0m35.885s user 0m0.000s sys 0m2.920s Clearing buffer cache... Running tar... real 0m42.644s user 0m0.100s sys 0m4.090s ===> Run #4 1.2G tmp Clearing buffer cache... Running cp... real 0m38.155s user 0m0.010s sys 0m2.930s Clearing buffer cache... Running tar... real 0m44.240s user 0m0.110s sys 0m4.020s ===> Run #5 1.2G tmp Clearing buffer cache... Running cp... real 0m35.885s user 0m0.000s sys 0m2.920s Clearing buffer cache... Running tar... real 0m42.644s user 0m0.100s sys 0m4.090s =-=-=-=-=- In point of fact, cp beats tar every time. By a substantial amount: more than 5s per gigabyte. Copy a terabyte of data, and you're talking around 90 minutes of difference. My expectation is that cp is more efficient at copying large files than tar is--ESPECIALLY if it uses MMIO (which it appears Tom has disproven, but)... MMIO generally performs better than read()/write() copies (assuming the OS's virtual memory managment is not utterly broken) because the latter will double-buffer every I/O, AND because the former uses direct memory access (e.g. memcpy()), whereas each read() and write() suffer the overhead that all system calls suffer, every time they're called. That's more than enough to offset the set-up and tear-down of the mmapped segments, and any minor page faults incurred, by a very substantial margin. See Stevens, Advanced Programming in the Unix Environment, Ch 12, p.411. The only time MMIO should lose is if your system is under very heavy memory pressure already, or if your OS's VMM is garbage. That benefit is lessened if your data consists of a large number of small files; but the performance of MMIO should at least be roughly on par with read()/write() in that case, and may still be better. Unless, again, your system is broken or lacks adequate memory. Nevertheless, as Tom points out, that seems not in play here. Even still, I reran the test after removing the 1GB file, and frankly I expected tar to compare more favorably, but it did not. With that data, cp completed on average in about 2.95s, and tar completed in about 3.2s. The results were fairly consistent across runs, except that in one run, tar did beat cp; but given the other results, this is probably explained away as random noise from the system running background tasks. -- Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02 -=-=-=-=- This message is posted from an invalid address. Replying to it will result in undeliverable mail due to spam prevention. Sorry for the inconvenience.
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |