Reputation: 554
I have all my data available in S3 location s3://sample/input_data
I do my ETL by deploying AWS EMR and using PySpark.
PySpark script is very simple.
s3://sample/input_data
as spark dataframe.s3://sample/output_data
s3://sample/input_data
to s3://sample/archive_data
and delete all data in s3://sample/input_data
So when a new data comes in s3://sample/input_data
, it only process the new file and save it in s3://sample/output_data
with partition.
Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?
Upvotes: 0
Views: 579
Reputation: 427
You can use step function in EMR. The jar would be script-runner.jar s3://.elasticmapreduce/libs/script-runner/script-runner.jar where is the Region in which your Amazon EMR cluster resides. You can use script-runner.jar to run scripts saved locally or on Amazon S3 on your cluster
You must specify shell script to run. In you case cp command
Upvotes: 0
Reputation: 6998
You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/
so that you only process data from that time interval.
Upvotes: 1