Reputation: 11
I am currently trying to migrate about a petabyte of data that currently sits on a local SMB share to AWS S3. The problem I am having is the loss of the original file creation metadata when using Datasync or copying the file to the S3 bucket. I need the files in an object storage configuration for batch analysts.
I have one solution - Use an AWS CLI to move and create user defined metadata
But I don't think this will work at scale.
Other suggestions have been to migrate the data to AWS FSx then use AWS Lambda to move to the bucket.
Any help or suggestions would be awesome.
Upvotes: 0
Views: 463
Reputation: 16304
I have one solution - [...] to move and create user defined metadata
with the removal of the "CLI" part, I agree. S3 is object storage, not filesystem, so you'll have to add your own metadata to the objects to express further information.
I don't think [AWS CLI] will work at scale.
For a project of this magnitude you will need an idempotent and performant solution that can be scaled out on at least one computer.
migrate about a petabyte of data
A petabtye is a lot of data. You should first run some numbers to see how quickly you can upload 1,000,000GB of data on your current upload links. Gigabit is ideally 125MB/s let's say, so 8000000 seconds, so about 92.59 days on Gigabit. Do you have multi gigabit uplinks? Or are you willing to slowly and continually do this over 3 months?
If that is starting to sound like more data than you're willing to wait to upload, consider https://aws.amazon.com/snowball/. The gist of it is , they ship you a NAS device, you upload your data to it, and then they ship it back.
migrate the data to AWS FSx then use AWS Lambda to move to the bucket.
A petabyte is a lot of data, even from AWS component to AWS component. If S3 is the desired final location for the data, then it makes an ideal ingress point for the data also. When it comes down to it, "migrate the data to AWS FSx" is a very similar operation to "upload the data to S3" , except filesystems are involved, which you have to expose to the uploaders and secure. S3 on the other hand, will horizontally scale your network under the table in the who-is-the-bottleneck contest. Now if you want the data in a filesystem, by all means, consider FSX carefully. But if you want in S3, but it there to begin with.
It's hard to offer a bunch of programming advice because this is a big question, but whatever you do, try to plan smart, because you don't want to realize you have to redo a few days, weeks, or months of data transfer. Make sure you have a way of restarting your transfer process without having to retransfer any more bytes than necessary, as that's going to be the limiting factor. It's a task that might be implemented well with a worker pool and work queue system backed by durable storage. Start small, get your process down - whether or not you have to code anything it yourself - and make sure you get it right the first time so you don't have to do it all over again.
Upvotes: 2