Jeeva Bharathi
Jeeva Bharathi

Reputation: 554

How to perform incremental load using AWS EMR (Pyspark) the right way?

I have all my data available in S3 location s3://sample/input_data

I do my ETL by deploying AWS EMR and using PySpark.

PySpark script is very simple.

So when a new data comes in s3://sample/input_data, it only process the new file and save it in s3://sample/output_data with partition.

Is there any inbuilt latch AWS EMR provides that I should be aware of which I can use it instead of doing the last step of my PySpark script?

Upvotes: 0

Views: 579

Answers (2)

Addy
Addy

Reputation: 427

You can use step function in EMR. The jar would be script-runner.jar s3://.elasticmapreduce/libs/script-runner/script-runner.jar where is the Region in which your Amazon EMR cluster resides. You can use script-runner.jar to run scripts saved locally or on Amazon S3 on your cluster

You must specify shell script to run. In you case cp command

Upvotes: 0

Robert Kossendey
Robert Kossendey

Reputation: 6998

You could either use Delta Lake for those purposes or partition your input directory by a time interval like s3://sample/input_data/year=2021/month=11/day=11/ so that you only process data from that time interval.

Upvotes: 1

Related Questions