Keeping state in airflow pipeline

Question

I am new to airflow, I feel like I may be missing some convention or concept.

Context: I have files being periodically dropped into an S3 bucket. My pipeline will need to grab new files and process them.

Basically: How do I avoid re-processing?

It is not unlikely that some part of the pipeline will change in the future and I will want to re-process files. But on a day-to-day basis I don't want to re-process files. Additionally there will likely be other pipelines in the future which would need to start from the beginning and process all the files for a different output.

I have plenty of scrappy ways of preserving state (a local json file, or checking the existence of output files) - but I'm wondering if there's a convention in airflow. What makes the most sense to me at the moment is to re-use the postgres that exists for airflow (maybe bad form?), add another DB and start creating tables in there where I list input files if they have been processed for workflow X, workflow Y, etc.

How would you do this?

Keeping state in airflow pipeline

Answers (1)

Related Questions