alexanoid
alexanoid

Reputation: 25862

Apache Spark/AWS EMR and tracking of processed files

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.

I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.

From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.

Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?

Upvotes: 2

Views: 703

Answers (1)

N_C
N_C

Reputation: 992

If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.

  • Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.

  • Again checkpointing only works in the streaming mode of spark job.

Upvotes: 2

Related Questions