Mohammad
Mohammad

Reputation: 1013

How to copy new files from a blob storage container to another container using Azure Data Factory?

I have an Azure blob storage and I want to copy files from "Container_Source" to "Container_Sink" with the below conditions:

I want to copy files starting with "Energy", and ending with ".zip". Sample filename could be "Energy_Payment_20231209110007_0000000404988124.zip"

Also, for the first time copy, I want to copy all files.

For next runs, I want to copy only new files, files that arrived after the last copy. In fact, I don't want to copy all "Energy .... .zip" files every time I run the pipeline.

Is there any way to achieve this goal in Azure Data Factory?

Upvotes: 0

Views: 260

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11549

For this, you need to use two extra temporary files and one extra copy activity after original copy activity.

First time, to copy all the files use this method and debug the pipeline.

For the newly loaded files,

  • Take two csv files. One is a blank file with one dummy column and one row, and another csv file should contain a column curr_date and the row value for this should be your last pipeline run time.

    enter image description here

  • You need to give this value manually for the first time. For next runs, the extra copy activity will automatically update this value.

    enter image description here

Create csv datasets for these two in the ADF and give the date csv dataset to the lookup activity.

enter image description here

This will give the curr_date column from the above file and use this value in the original copy activity source setting Filter by last modified like below.

@string(activity('Lookup1').output.firstRow.curr_date)

enter image description here

Give the wild card file path for the required files. Use your target binary dataset as sink dataset.

After this, take another copy activity and give the blank csv dataset as source dataset for this. Use Additional column in the source setting and give @utcNow() as dynamic content.

enter image description here

Give the date dataset as target dataset for this copy activity.

enter image description here

So, after every copy activity run, the second copy activity will update the date csv file with last pipeline run time. For the next pipeline run, we will use this time from the lookup activity and filters out the new files which are loaded after the last pipeline run.

These are my source files with newly loaded files:

enter image description here

You can see only newly loaded files copied to my target location.

enter image description here

Upvotes: 1

Related Questions