Dnaiel
Dnaiel

Reputation: 7832

gzip file from one filesystem to another fastest way

I need to transfer 2,000 files (30 TB uncompressed data) from filesystem 1 to filesystem 2 (reduced to ~ 8 TB when compressed with gzip), through a bandwith of 100 MB/sec.

Is there a command such that I can write the gzipped files to the new filesystem directly so I don't have to transfer the 30 TB of data but rather just copy the gzipped files to the new system?

Would this command work, or there are another alternatives?

gzip -c /my/dir/foo.txt > /my/new/filesystem/foo.txt.gz

In other words, this command would only attempt to copy the compressed gz file, not the whole file, am I right? so in the /my/new/filesystem/ my files will use 1/3 of the space than in the original /my/dir/?

The data is in a high performance cluster so I can transfer them in parallel but I am unsure how many parallel cores to use. If I use 2,000 cores I'll probably not gain much speed since the processor speed is faster than the 100 MB/sec bandwith anyways.

I am looking for the gzip command, and for a good parallelization strategy to transfer the data the fastest possible.

Note1. The new server (filesystem) its connected to the Cluster and it talks to the old filesystem through a 100 MB/sec connection. What I refer as the Cluster is a computing center that can submit multiple jobs in parallel (more detailed info in note 2). The new server where I am transferring the data to (i.e., what I call the new filesystem) is a Dell Server, PE R515 with up to 12 Hot Swap HDDs and 2 Cabled Hard Drives, LED and AMD Opteron 42XX Procs, 4TB 7.2K RPM Near-Line SAS 6Gbps 3.5in Hot-plug Hard Drive. More info here: http://mindmeeting.blogspot.com/2014/01/server-information.html. The OS is centOS 6.

Note2. This is as much info I have about the Cluster architecture. The original cluster was built from 512 Dell PowerEdge M600 blades distributed in 32 M1000 chassis, each with dual Xeon E5410 2.3Ghz quad core processors for a total of 4096 cores. Each of these nodes host 32 GB RAM and both DDR Infiniband and Gb ethernet connectivity. It has since been expanded to the architecture below with the addition of dedicated access, interactive, specialty, and service systems as well as several additional groups of compute nodes. The cluster image is based on RHEL 5 and shared storage is hosted on several nfs (ie: home directories) and two lustre instances (high performance scratch and data respectively).

Upvotes: 0

Views: 515

Answers (1)

Mark Setchell
Mark Setchell

Reputation: 207668

Some thoughts:

1) I would benchmark "rsync" with compression as it is restartable. You can also do multiple "rsyncs" in parallel.

2) Also, are the disks attached to a SAN? Can you mount the new filesystem to the existing host and then unmount and remount to the new host?

3) Also, never under-estimate the bandwidth of a lorry full of tapes! If you have LTO available that has a higher bandwidth than your network.

Some additional information about your system would be helpful - e.g. SAN, infrastructure, distance between servers, whether you could temporarily add network interfaces...

Upvotes: 2

Related Questions