Reputation: 89
I'm trying to transfer data stored in a DynamoDB table on product ratings to a csv file that can br processed by a recommendation model deployed on AWS Sagemaker.
I'm using AWS Glue to transform the data to a .csv file that can ve processed by the ML model for training, the problem is, everytime the whole database table gets transformed, creating duplicate data and slow processing speed.
I found a solution to the duplicate data issue by deleting old s3 objects before performing the ETL job, but it feels like a temporary hacky fix.
What i want to do is to collect new data in a dyanmodb table. On a daily or weekly basis, the ETL job collects the new data and if there was any during the specified period, the new data gets added to the s3 bucket and the model gets retrained.
Upvotes: 0
Views: 867
Reputation: 1574
If you are only concerned about new records and not worried about updates to old records,
Upvotes: 0