BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] rsync v. cp in data migration

Subject: [Discuss] rsync v. cp in data migration
From: alex at pennace.org (Alex Pennace)
Date: Fri, 24 May 2013 23:29:46 -0400
In-reply-to: <CA+V8C2jswDPvQa5Ba8PXUDBfC_5H7Y1+cguy=UuatJjScWfKMA@mail.gmail.com>
References: <CA+V8C2jswDPvQa5Ba8PXUDBfC_5H7Y1+cguy=UuatJjScWfKMA@mail.gmail.com>

On Fri, May 24, 2013 at 08:47:37PM -0400, Steve Harris wrote:
> 1) Using a tar pipeline will (should) always be slower than a single
> process (e.g., cp, cpio -p, rsync), because of the overhead of the two
> processes and the system buffering for the pipe.

Not necessarily. Earlier in this thread, someone mentioned the
sendfile(2) system call in Linux. sendfile is largely limited to
sending data out via a socket. The more versatile solution to the
problem of throwing data is splice(2). If I am reading the tea leaves
correctly, two splice flows with a pipe in the middle Do The Right
Thing (the kernel ends up copying data directly from the fd_in of the
first splice call to the fd_out of the second).

Lets say we modify tar to use splice extensively. Recall that a tar
archive is a stream of headers and file data; the header for each
archive member specifies the length of that member. Our intrepid
sysadmin does this to move a lot of files around:

  $ tar -c | (cd /newdir; tar -x)

* The reading tar process gets the size of the next archive member
  (via stat) and writes a header to standard output (the pipe's write
  end, file descriptor 1)
* The reading tar process calls splice, with fd_in set to the file's
  file descriptor, fd_out set to 1, and len set to the file size.
* The reading tar process writes out enough '\0' bytes to round up the
  tar output to the nearest 512 bytes and repeats.
* The writing tar process reads the header from its standard input
  (the pipe's read end, file descriptor 0) and learns the size of the
  incoming archive member.
* The writing tar process calls splice, with fd_in set to 0, fd_out
  set to the file's file descriptor, and len set to the file size.
* The writing tar process reads the filler '\0' bytes and discards
  them.

In this way, there is no userspace copying of file data at all.

The big drawback to splice(2) is one of its ends must be a pipe. Our
modified tar will have to take care to employ it only when its dealing
with a pipe (on the other hand, GNU tar already does an fstat on its
output to check to see if it is going to /dev/null).

It remains to be seen if the Linux kernel will ever offer a
splice-like system call that handles the general case. In the
meantime, user-space processes desiring a general approach could
employ two threads each with a splice call between an intra-process
pipe. One wonders why no one has come up with a cp that does just that.

-- 
Alex Pennace, alex at pennace.org, http://osiris.978.org/~alex/

Follow-Ups:
- [Discuss] rsync v. cp in data migration
  - From: bogstad at pobox.com (Bill Bogstad)

References:
- [Discuss] rsync v. cp in data migration
  - From: vsharris at gmail.com (Steve Harris)

Prev by Date: [Discuss] rsync v. cp in data migration
Next by Date: [Discuss] rsync v. cp in data migration
Previous by thread: [Discuss] rsync v. cp in data migration
Next by thread: [Discuss] rsync v. cp in data migration
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org