romanzdk
romanzdk

Reputation: 1332

Azure data factory - large dependency pipeline

I have a "master" pipeline in Azure Data factory, which looks like this:

enter image description here

One rectangle is Execute pipeline activity for 1 destination (target) Table, so this "child" pipeline takes some data, transform it and save as a specified table. Essentialy this means that before filling table on the right, we have to fill previous (connected with line) tables.

The problem is that this master pipeline contains more than 100 activities and the limit for data factory pipeline is 40 activities.

I was thinking about dividing pipeline into several smaller pipelines (i.e. first layer (3 rectangles on the left), then second layer etc.), however this could cause pipeline to run a lot longer as there could be some large table in each layer.

How to approach this? What is the best practice here?

Upvotes: 0

Views: 831

Answers (2)

virtualdvid
virtualdvid

Reputation: 2421

Had a similar issue at work but I didn't used Execute Pipeline because it is a terrible approach in my case. I have more than 800 PLs to run with multiple parent and child dependencies that can go multiple levels deep depending the complexity of the data plus several restrictions (starting with transforming data for 9 regions in the US reusing PLs). A simplified diagram of one of many cases I have can easily look like this:

PLs dependencies

The solution:

A master dependency table where to store all the dependencies:

| Job ID | dependency ID | level |   PL_name    |
|--------|---------------|-------|--------------|
| Token1 |               |   0   |              |
| L1Job1 | Token1        |   1   | my_PL_name_1 |
| L1Job2 | Token1        |   1   | my_PL_name_2 |
| L2Job1 | L1Job1,L2Job2 |   2   | my_PL_name_3 |
|  ...   |      ...      |  ...  |     ...      |

From here it is a tree problem:

There are ways of mapping trees in SQL. Once you have all the dependencies mapped from a tree put them in a stage or tracker table:

| Job ID | dependency ID | level |  status   | start_date | end_date |
|--------|---------------|-------|-----------|------------|----------|
| Token1 |               |   0   |           |            |          |
| L1Job1 | Token1        |   1   | Running   |            |          |
| L1Job2 | Token1        |   1   | Succeeded |            |          |
| L2Job1 | L1Job1,L2Job2 |   2   |           |            |          |
|  ...   |      ...      |  ...  |    ...    |    ...     |   ...    |

We can easily query this table using a Look up activity to get the PLs level 1 to run and use a For Each activity to trigger the target PL to run with a dynamic Web Activity. Then Update the tracker table status, start_date, end_date, etc accordantly per PL.

There are only two PLs orchestrating:

  • one for mapping the tree and assign some type of unique ID for that batch.
  • two for validation (verifies status of parent PLs and controls which PL to run next)

Note: Both call a store procedure with some logic depending the case

I have a recursive call to the validation PL each time a target pipeline ends:

Lets assume L1Job1 and L1Job2 are running in parallel:

  1. L1Job1 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.

  2. If L1Job2 hasn't ended the validation PL ends without triggering L2Job1.

  3. Then L1Job2 ended successful -> calls validation PL -> validation triggers L2Job1 only if L1job1 and L1Job2 have a succeeded status.

  4. L2Job1 starts running after passing the validations.

  5. Repeat for each level.

This works because we already mapped all the PL dependencies in the job tracker and we know exactly which PLs should run.

I know this looks complicated and maybe can't apply to your case but I hope this can give you or others a clue on how to solve complex data workflows in Azure Data Factory.

Upvotes: 1

IpsitaDash-MT
IpsitaDash-MT

Reputation: 1450

Yes as per documentation, Maximum activities per pipeline, which includes inner activities for containers is 40 only.

So, there is only option left is splitting your pipeline in to multiple small pipelines.

Please check below link to know limitations on ADF

https://github.com/MicrosoftDocs/azure-docs/blob/master/includes/azure-data-factory-limits.md

Upvotes: 0

Related Questions