FelixC
FelixC

Reputation: 53

Airflow kubernetes pod operator and sharing files between tasks?

I have 3 containers images that would run my workload.

(each of these expects these file in its own file system)

  1. Container 1 generates file_1
  2. Container 2 consume file_1 and generates file_2
  3. Container 3 consume file_1 and file_2 and generate file_3

So airflow tasks would be:

So container 1 >> container 2 >> container 3

I want to use the KubernetesPodOperator for airflow to take advantage of auto-scaling options for airflow running in kubernetes. But since a KubernetesPodOperator create one pod per task, and each of these are their own tasks, how can I pass these files around?

I can modify the source code in each container to be aware of an intermediate location like s3 to upload files, but is there a way to built in airflow way of doing this without modifying the source workers?

Upvotes: 3

Views: 2455

Answers (2)

Harsh Manvar
Harsh Manvar

Reputation: 30083

You can use the S3 amazon operator in airflow : https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/operators/s3.html

Or you write custom boto3 code however if you are not looking for code you can use the NFS or EFS services.

Read more about that : https://medium.com/asl19-developers/create-readwritemany-persistentvolumeclaims-on-your-kubernetes-cluster-3a8db51f98e3

You want to scale so in this case you have to use the : RWX — ReadWriteMany

You can also check out the different NFS services like : Minio, GlusterFS, etc which will provide you PVC with the ReadWriteMany option.

Files will be persistent into PVC disk managed by NFS or if using EFS service AWS, all PODs can use those files and access it.

If you are on GCP GKE, feel free to review my other answer: How to create a dynamic persistent volume claim with ReadWriteMany access in GKE?

Upvotes: 0

halil
halil

Reputation: 1812

Airflow does not pass files, there is xcom but it is not for files more like small information data passing between tasks.

I would suggest S3 like you already mentioned. Another alternative is using k8s native features, so you can mount same disk volume (persistent disk volumes) to all 3 containers and they can read/write files in the local file system which actually backed by a shared file system on k8s cluster level. But this is a little more complex setup than just using s3, so I would only do it if s3-like system is not an option for my setup.

Upvotes: 0

Related Questions