Fact
Fact

Reputation: 2440

Running AIRFLOW in aws lambda

I am wondering if it's possible to run airflow inside aws lambda. I am trying to build an ETL pipeline which is server less and using airflow. I am not very keen on using docker for this. Any guidance will be appreciated.

Upvotes: 2

Views: 5417

Answers (2)

dlamblin
dlamblin

Reputation: 45321

It's probably not a good idea, but in a proof of concept only, it might be doable.

The standard Airflow deployment has one or more web-servers running. With my few 1000 DAG files, the start up of the web-server is almost 20 minutes, though this has been improved in 1.10.2 vs 1.8 and 1.10 that I'm using.

It also has one scheduler, which is basically always running.

Finally, if you have celery executor you want worker nodes to run picking up tasks. OTOH if you have kubernetes executor the scheduler creates worker pods for the work queued (I think). These are also supposed to be always running.

Now, in AWS you could make a zip with all of Airflow's dependencies, a config file, and maybe a shim script to grab the latest DAG files from S3. The scheduler has a loop limit argument, so there you could set it to a single loop (or with very few DAG files, why not 50 loops, it's <1s per file usually) instead of infinite. Then you could use some external trigger to run that regularly. Say you know that you only schedule DAGs about on the 10min mark, and your task usually take about 7-9 minutes, then a trigger every 10 minutes to run that scheduler, it might just work. Using Celery with SQS you can probably kick off worker tasks as AWS lambda whenever something is in the queue. Or with Kubernetes you would leave that EKS cluster up and let the scheduler push work to it.

The tricky part then ends up being the web-server. While its true that you could probably use EC2 or a ECS or EKS docker image and start and stop it only when you want it, it does use quite a bit of resources to build the DAG bag; like the scheduler; but it only starts serving requests after doing that, so it's not too well suited to run in AWS Lambda at all. I mean… if you totally rebuilt the UI so that most of it is static files in S3 and only some requests trigger lambda to get the data from the DB… yes, that would work. But you'd be running a heavily customized Airflow.

At which point you wonder, if I have a lot to develop in AWS Lambda to support this, how much more work is it to develop the whole DAG flow I need with RDS and Lambda but without Airflow?

Upvotes: 0

SergiyKolesnikov
SergiyKolesnikov

Reputation: 7815

I think it is not possible. Even if you manage to deploy all required dependencies and Airflow itself as a Lambda, the service has some hard limits that cannot be changed and that will prevent Airflow from running as a service. For example, the maximum run time for a Lambda function is 15 minutes and the Airflow scheduler has to run continuously.

Using AWS services you can get approximately the same functionality as with Airflow: Glue for writing ETL jobs, and StepFunctions to manage them.

Upvotes: 4

Related Questions