kenshin9
kenshin9

Reputation: 2365

Exporting large datasets from RDS to S3 on a schedule?

To give some background, I'm looking to scale the creation of datasets to use with training an ML model. I'm dealing with large amounts of data, as in, 200+ million records. This is for the initial dataset, but as we create incremental datasets for retraining, they'll be smaller, maybe around 10 million. I'm thinking this incremental dataset would be weekly.

The data is in RDS and I need to get the data to S3. I don't really need to do any transformations at the moment, so it's likely just going to be running a query and saving those results to CSV for now. But I imagine I can't run the entire query, so I'll need to chunk it up. I'm dealing with stores in regions, so I'm thinking I could do something like querying a set of stores per region, per month, and each will be a separate file. The RDS table partitions the data by month, and has the proper keys I need to run this query.

Now this is where I'm currently stuck and would like to hear from the community. I'm trying to figure out what service would be recommended for this sort of process.

I most commonly read to use either AWS Data Pipelines or AWS Glue. And then on the other hand, I've also read some people making it sound like Data Pipelines is now like an after thought, or that it's not actively maintained or worked on. And from my understanding, I would still need to write scripts to do what I want anyway.

AWS Glue sounds like the better option of the two. But I wonder if it's overkill or not. Like I mentioned, I don't really need to do any transformations on the data, and it's only coming from one source, RDS. But if anything, I'm at least dealing with a large amount of data. So maybe it's not overkill?

I have also considered using Lambdas, but I would have to create a few additional resources such as SQS queues to queue up the jobs so I could process them as separate events and avoid potential timeouts. And then I would need to handle knowing when I'm done processing files for all of the stores.

Does anyone have any advice or input on where I should start looking, and some words on their experience working with the service?

Upvotes: 0

Views: 111

Answers (0)

Related Questions