ac123
ac123

Reputation: 15

Multi Step Incremental load and processing using Azure Data Factory

I wanted to achieve an incremental load/processing and store them in different places using Azure Data Factory after processing them, e.g:

External data source (data is structured) -> ADLS (Raw) -> ADLS (Processed) -> SQL DB

Hence, I will need to extract a sample of the raw data from the source, based on the current date, store them in an ADLS container, then process the same sample data, store them in another ADLS container, and finally append the processed result in a SQL DB.

ADLS raw:

2022-03-01.txt

2022-03-02.txt

ADLS processed:

2022-03-01-processed.txt

2022-03-02-processed.txt

SQL DB:

All the txt files in the ADLS processed container will be appended and stored inside SQL DB.

Hence would like to check what will be the best way to achieve this in a single pipeline that has to be run in batches?

Upvotes: 1

Views: 976

Answers (2)

Amir parkar
Amir parkar

Reputation: 60

You can't rename multiple files at once so you have to copy files one after the other.

  • Create a pipeline with tumbling window trigger - create two parameters in the trigger and pipeline named WindowStartTime and WindowEndTime
  • Create a GetMetaData activity use the parameter last modified datetime and pass WindowStartTime and WindowEndTime to get list of files that were placed between WindowStartTime and WindowEndTime
  • Create a ForEach activity pass the data received from Getmetadata
  • Create copy activity inside for activity and pass the file name from ForEach loop
  • In the sink dataset pass file name and concatenate "_processed/txt"
  • Create a Copy activity after the for each activity with source as processed layer again pass WindowStartTime and WindowEndTime
  • This Copy activity will read the latest files received on the current day and append it to SQL DB

Upvotes: 1

Pratik Somaiya
Pratik Somaiya

Reputation: 733

You can achieve this using a dynamic pipeline as follows:

  1. Create a Config / Metadata table in SQL DB wherein you would place the details like source table name, source name etc.

  2. Create a pipeline as follows:

    a) Add a lookup activity wherein you would create a query based on your Config table https://learn.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity

    b) Add a ForEach activity and use Lookup output as an input to ForEach https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

    c) Inside ForEach you can add a switch activity where each Switch case distinguishes table or source

    d) In each case add a COPY or other activities which you need to create file in RAW layer

    e) Add another ForEach in your pipeline for Processed layer wherein you can add similar type of inner activities as you did for RAW layer and in this activity you can add processing logic

This way you can create a single pipeline and that too a dynamic one which can perform necessary operations for all sources

Upvotes: 1

Related Questions