coderboi
coderboi

Reputation: 341

What is task parallelism in the context of airflow?

Task parallelism in general is when multiple tasks run on the same or different set of data. But what is it in the context of airflow, when I change the parallelism parameter in the airflow.cfg file?

For instance, say I want to run a data processor on a batch of data. Will setting parallelism to 32, split the data into 32 sub-batches and run the same task on those sub-batches?

Or maybe, if somehow have 32 batches of data originally, instead of 1, I am able to run the data processor on all 32 batches(ie 32 task runs at the same time).

Upvotes: 0

Views: 180

Answers (1)

Elad Kalif
Elad Kalif

Reputation: 15931

The setting won't "split the data" within your DAG. From the docs:

parallelism: This variable controls the number of task instances that runs simultaneously across the whole Airflow cluster

If you want to parallel execution of a task you will need to break it further meaning create more tasks but each task does less work. That can be come handy for some ETLs.

For example:

Lets say you want to copy yesterday records from MySQL to S3.

You could do it with a single MySQLToS3Operator that reads yesterday data in a single query. However you can also break it to 2 MySQLToS3Operator reading 12 hours data or 24 operators each reading hourly data. That is up to you and the limitation of the services you are working with.

Upvotes: 1

Related Questions