Feres Gaaloul
Feres Gaaloul

Reputation: 89

How to perform ETL job on AWS DyanmoDB to get csv files in S3 only on new data with AWS Glue

I'm trying to transfer data stored in a DynamoDB table on product ratings to a csv file that can br processed by a recommendation model deployed on AWS Sagemaker.

I'm using AWS Glue to transform the data to a .csv file that can ve processed by the ML model for training, the problem is, everytime the whole database table gets transformed, creating duplicate data and slow processing speed.

I found a solution to the duplicate data issue by deleting old s3 objects before performing the ETL job, but it feels like a temporary hacky fix.

What i want to do is to collect new data in a dyanmodb table. On a daily or weekly basis, the ETL job collects the new data and if there was any during the specified period, the new data gets added to the s3 bucket and the model gets retrained.

Upvotes: 0

Views: 867

Answers (1)

Sasank Mukkamala
Sasank Mukkamala

Reputation: 1574

If you are only concerned about new records and not worried about updates to old records,

  • you can enable streams on the dynamodb table
  • have a lambda function read them and append new records to csv file in s3bucket/new/date-file.csv.
  • After each ETL, move files to s3bucket/archive/date-file.csv.

Upvotes: 0

Related Questions