How to perform ETL job on AWS DyanmoDB to get csv files in S3 only on new data with AWS Glue

Question

I'm trying to transfer data stored in a DynamoDB table on product ratings to a csv file that can br processed by a recommendation model deployed on AWS Sagemaker.

I'm using AWS Glue to transform the data to a .csv file that can ve processed by the ML model for training, the problem is, everytime the whole database table gets transformed, creating duplicate data and slow processing speed.

I found a solution to the duplicate data issue by deleting old s3 objects before performing the ETL job, but it feels like a temporary hacky fix.

What i want to do is to collect new data in a dyanmodb table. On a daily or weekly basis, the ETL job collects the new data and if there was any during the specified period, the new data gets added to the s3 bucket and the model gets retrained.

Sasank Mukkamala · Accepted Answer

If you are only concerned about new records and not worried about updates to old records,

you can enable streams on the dynamodb table
have a lambda function read them and append new records to csv file in s3bucket/new/date-file.csv.
After each ETL, move files to s3bucket/archive/date-file.csv.

How to perform ETL job on AWS DyanmoDB to get csv files in S3 only on new data with AWS Glue

Answers (1)

Related Questions