best way to migrate billions of files on a single partition in a data center to s3?

Question

We have a data center with a 10G direct connect circuit to AWS. In the data center, we have an IBM XIV storage infrastructure with GPFS filesystems containing 1.5 BILLION images (about 50k each) in the single top level directory. We could argue all day about how dumb this was, but I'd rather seek advice for my task which is moving all these files into an s3 bucket.

I can't use any physical transport solution, as the data center is physically locked down and obtaining on-premises physical clearance is a 6-month process.

What is the best way to do this file migration?

The best idea I have so far is building an EC2 linux server in AWS, mounting the s3 destination bucket using s3fs-fuse (https://github.com/s3fs-fuse/s3fs-fuse/wiki/Fuse-Over-Amazon) as a filesystem on the EC2 server, and then running some netcat + tar command between the data center server holding the GPFS mount and the EC2 server. I found this suggestion on another post: Destination box: nc -l -p 2342 | tar -C /target/dir -xzf - Source box: tar -cz /source/dir | nc Target_Box 2342

Before I embark on a task that could take a month, I wanted to see if anyone here had a better way to do this?

Michael - sqlbot · Accepted Answer

If you're good with a month, what you're contemplating might work... but there are pitfalls along that path.

To explain those, I need to get a little bit philosophical.

When faced with a resource-intensive job that you want to optimize, it's generally best to figure out which of several limited resources is going to be the best one to push to its limit, and then make certain all of the other resources will be sufficient to let that happen. Sometimes, you actually end up pushing one resource to an artificial and unnecessary limit.

In 1 millisecond, a 10 Gbit/s link can transfer 10 Mbits. Every millisecond you waste not transferring data increases the runtime of the job by that much more. So, you need to keep the data flowing... and your solution will not accomplish that.

S3 can easily handle 100 uploads per second, which is 1 upload every 10ms if they are uploaded sequentially... and s3fs is unlikely to be able to keep pace with that, and every 10ms you could have been transferring 100 Mbits across your link... but you didn't. You only managed 1 50k object, or less. While s3fs is indisputably very cool -- I use it in one application for production back-end systems -- it's also sort of the most theoretically incorrect way to use S3 that actually works, because it tries to treat S3 like a filesystem... and expose it to the operating system with filesystem semantics... while S3 is an object store, not a filesystem, and there is an "impedance gap" between the two.

The artificial choke point here will be s3fs, which will only be allowing tar to extract one file at any given instant. The output of tar will repeatedly block for some number of micro- or milliseconds waiting for s3fs on each object, which will block tar's input from the network, which will block the TCP connection, which will block the source tar... meaning that you won't actually be maximizing the use of any of your real resources, because you're hitting an unnecessary limit.

Never mind what happens if s3fs encounters an error. Depending on the nature of the error...

tar: broken pipe

D'oh.

What you really need is concurrency. Push those files into S3 in parallel as fast as S3 will take them.

Your best bet for this will be code running in the private data center. Split up the list of files into several chunks. Spawn multiple independent processes (or threads) to handle one chunk of files, reading from disk and uploading to S3.

If I were doing this (as, indeed I have done) I'd write my own code.

You could, however, fairly easily accomplish this using the aws CLI's aws s3 cp command in conjunction with gnu parallel, which can be configured to behave in a manner similar to xargs -- each of "n" parallel invocations of aws s3 cp being directed to copy a list of files that parallel builds from stdin and passes in on the command line.

Untested, but on the right track... cd into the directory of the files, and then:

  $ ls -1 -f | parallel --eta -m aws s3 cp {} s3://bucket-name

ls -1 -f lists the files in the directory, 1 per line, names only, unsorted, output piped to parallel.

--eta estimates remaining runtime based on progress so far.

-m means replace {} with as many input args as possible while not exceeding the shell's limit for command line length

See the docs for gnu parallel for other options, such as log files, error handling, and controlling the number of parallel processes to spawn (which should default to the number of cores you have in the machine where this is running). As long as you have free processor capacity and memory, you might want to run 2x, 3x, 4x the number of parallel jobs as there are cores, because the processors will be wasting a lot of time waiting for network I/O otherwise.

best way to migrate billions of files on a single partition in a data center to s3?

Answers (2)

Related Questions