Reputation: 341
Task parallelism in general is when multiple tasks run on the same or different set of data. But what is it in the context of airflow, when I change the parallelism parameter in the airflow.cfg file?
For instance, say I want to run a data processor on a batch of data. Will setting parallelism to 32, split the data into 32 sub-batches and run the same task on those sub-batches?
Or maybe, if somehow have 32 batches of data originally, instead of 1, I am able to run the data processor on all 32 batches(ie 32 task runs at the same time).
Upvotes: 0
Views: 180
Reputation: 15931
The setting won't "split the data" within your DAG. From the docs:
parallelism: This variable controls the number of task instances that runs simultaneously across the whole Airflow cluster
If you want to parallel execution of a task you will need to break it further meaning create more tasks but each task does less work. That can be come handy for some ETLs.
For example:
Lets say you want to copy yesterday records from MySQL to S3.
You could do it with a single MySQLToS3Operator
that reads yesterday data in a single query. However you can also break it to 2 MySQLToS3Operator
reading 12 hours data or 24 operators each reading hourly data. That is up to you and the limitation of the services you are working with.
Upvotes: 1