Apache Spark/AWS EMR and tracking of processed files

Question

I have AWS S3 folder where the big number of JSON files is stored. I need to ETL these files with AWS EMR over Spark and store the transformation into AWS RDS.

I have implemented the Spark job for this purpose on Scala and everything is working fine. I plan to execute this job once a week.

From time to time the external logic can add a new files to AWS S3 folder so the next time when my Spark job is starting I'd like to process only the new(unprocessed) JSON files.

Right now I don't know where to store the information about the processed JSON files so the Spark job can decide what files/folders to process. Could you please advise me what is the best practice(and how) to track this changes with Spark/AWS?

N_C · Accepted Answer

If it is spark streaming job, checkpointing is what you are looking for, it is discussed here.

Checkpointing stores the state information (ie offsets etc) in hdfs/s3 bucket, so when the job is started again, spark picks up only the un-processed files. Checkpointing offers better fault tolerance in case of failures as well, as state is handled automatically by spark itself.
Again checkpointing only works in the streaming mode of spark job.

Apache Spark/AWS EMR and tracking of processed files

Answers (1)

Related Questions