Reputation: 1841
json files are posted daily to an s3 bucket. I want to take that json file, do some processing on it, then post the data to a new s3 bucket where it will get picked up and stored in Redshift. What would be the recommended AWS pipeline for this? AWS lambda that triggers when a new json file is placed on s3, that then kicks off something like an AWS batch job? Or something else? I am not familiar with all AWS web services so might be overlooking something obvious.
So the flow looks like this:
s3 bucket -> data processing -> s3 bucket -> redshift
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back. The processing is parsing of json data and some aggregation and data clean up.
Upvotes: 0
Views: 204
Reputation: 65594
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back.
Don't worry about scalability with Lambda, just focus on short running jobs. Here is an example: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html
I think one piece of the puzzle you're missing is the documentation for Schedule Expressions Using Rate or Cron: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html
Upvotes: 2