Search:  
Gentoo Wiki

TIP_Fast_Copy

This article is part of the Tips & Tricks series.
Terminals / Shells Network X Window System Portage System Filesystems Kernel Other

Contents

The normal way

To copy files recursively from src1/, src2/, to dest/ you do

cp -Rv src1/ src2/ dest/

The piped and pax way

cp uses character by character copy. Using the kernel pipes support, we could copy files block by block. tar converts the directories recursively into a single stream. At one end we create a single stream out of source directories and at other end we extract this stream and put it in the destination directory. The transfer between the two ends is by means of pipes.

tar -cS src1/ src2/ | tar -C dest/ -x

tar is more flexible than cp and may also be much faster. You may wish to put the -v flag on the right side with -x which will cause tar to print what it's doing, however, with this option the terminal may limit throughput.


Note that ext2, ext3 and most modern Linux filesystems have a capability called 'sparse files'. This is used to store large files with lots of zeroed content in an efficient manner. Most of the time you should not care about this, but certain programs make extensive use of this feature (e.g. net-p2p/mldonkey). You should be careful when copying sparse files using this method, as your disk usage can explode. Whereas the cp utility does handle sparse files automatically, tar does not. Sparse files in the source directory will be stored in full representation on the destination directory, unless the -S flag is used on the lefthand side of the pipe.

Also PAX can be used with the same result

1. cd into the directory
2. Check destination exists.
3. Issue the next command.

pax -rw . destination_directory

And it is easier to remember.

Network with SSH

tar czv ListOfFiles | ssh remote.box.com tar xz -C /home/user/PathToCopy
tar czv ListOfFiles | ssh -c blowfish remote.box.com tar xz -C /home/user/PathToCopy
ssh remote.box.com tar cz BeginDirCopyFiles |tar xz -C DirToCopy

Remote to local:

scp -C remotebox:path/to/sourcefile .

Local to remote:

scp -C localfiles remotebox:path/to/destination/

Also CPIO can be used with the same result:

ssh remotebox "cd /etc/ ; ls -1 issue motd |cpio -oH odc" >archive.cpio

(we have cd in /etc/ for relative path)

for extract this CPIO archive:

cpio -iv <archive.cpio

Note: The compatible arguments for retrieve files from HP-UX are "cpio -oc"

Network with netcat

Transfers over netcat can have minuscule CPU needs, unlike transfers over ssh. However the data is transmitted without encryption and authentication. Further disadvantages: (1) you have to open a port on destination system's firewall and close it afterwards. (2) you have to enter commands on both source- + destination-server. (3) do not install netcat on a critical system for security reasons.

Destination box: nc -l -p 2342 | tar -C /target/dir -xzf -
Source box: tar -cz /source/dir | nc Target_Box 2342

For further CPU use reduction, lzop can be used in place of the tar z option for much faster but less effective compression.

Destination box: nc -l 2342 | lzop -d | tar -C /target/dir -xf -
Source box: tar -c /source/dir | lzop | nc Target_Box 2342

Network with socat

Same idea as Netcat.

Destination box: socat -u tcp4-listen:2342 - | tar x -C /target/dir
Source box: tar c /source/dir | socat -u - tcp4:Target_Box:2342

We can use a variety of compression methods in a general way:

Destination box: socat -u tcp4-listen:2342 - | ${UNZIP} | tar x -C /target/dir
Source box: tar c /source/dir | ${ZIP} | socat -u - tcp4:Target_Box:2342

We then define ZIP as one of the following:

and UNZIP as:

Each of these compression programs has a different tradeoff between compression ratio and cpu time, so the best choice will depend on your processor, hard drive, and network speeds. The following results should give you a better feel for which program to use for a given situation:

# load the files into block cache
tar /usr/src/linux-2.6.3 > /dev/null
# determine compression ratio
tar /usr/src/linux-2.6.3 | ${ZIP} > /tmp/foo; ls -sh /tmp/foo
# determine cpu time
time -p tar /usr/src/linux-2.6.3 | ${ZIP} | cat > /dev/null
cat: 182MB 1.1sec
lzop: 64MB 4sec
gzip --fast: 51MB 9sec
gzip: 41MB 18sec
gzip --best: 41MB 49sec
bzip2: 32MB 134sec

Insanely fast with rsync

rsync -a source <user>@<host>:<somedir>
rsync -a <user>@<host>:<somedir> <localdir>

--question for clarification: why do say rsync copies are so insanely fast? i noticed them to be very slow on the first (creating) run ...
--answer: rsync compares the files on both computers (e.g. by last modified timestamp) and only copies if anything has changed. That's also why your portagetree update (emerge sync) is as fast as it is: rsync only downloads the packages that have changed. Why else it would be named rsync (synchronize)? ;-)
--answer2: rsync is slower than netcat or socat in the first run. rsync only gains its speed from its file-compare algorithm _after_ it has been run once.

--note: Use tar over ssh to begin with (if you need the initial run to be fast), then use rsync to keep up to date. Rsync doens't keep and doen't need metadata to work, there is no prerequisite for files to be transported initially using rsync.

In other words, it's only faster when there's going to be some data that can be omitted. A fresh copy will always be slower.
Retrieved from "http://www.gentoo-wiki.info/TIP_Fast_Copy"

Last modified: Fri, 05 Sep 2008 23:33:00 +0000 Hits: 54,925