Reputation: 1013
I have an Azure blob storage and I want to copy files from "Container_Source" to "Container_Sink" with the below conditions:
I want to copy files starting with "Energy", and ending with ".zip". Sample filename could be "Energy_Payment_20231209110007_0000000404988124.zip"
Also, for the first time copy, I want to copy all files.
For next runs, I want to copy only new files, files that arrived after the last copy. In fact, I don't want to copy all "Energy .... .zip" files every time I run the pipeline.
Is there any way to achieve this goal in Azure Data Factory?
Upvotes: 0
Views: 260
Reputation: 11549
For this, you need to use two extra temporary files and one extra copy activity after original copy activity.
First time, to copy all the files use this method and debug the pipeline.
For the newly loaded files,
Take two csv files. One is a blank file with one dummy column and one row, and another csv file should contain a column curr_date
and the row value for this should be your last pipeline run time
.
You need to give this value manually for the first time. For next runs, the extra copy activity will automatically update this value.
Create csv datasets for these two in the ADF and give the date csv dataset to the lookup activity.
This will give the curr_date
column from the above file and use this value in the original copy activity source setting Filter by last modified like below.
@string(activity('Lookup1').output.firstRow.curr_date)
Give the wild card file path for the required files. Use your target binary dataset as sink dataset.
After this, take another copy activity and give the blank csv dataset as source dataset for this. Use Additional column in the source setting and give @utcNow()
as dynamic content.
Give the date dataset as target dataset for this copy activity.
So, after every copy activity run, the second copy activity will update the date csv file with last pipeline run time. For the next pipeline run, we will use this time from the lookup activity and filters out the new files which are loaded after the last pipeline run.
These are my source files with newly loaded files:
You can see only newly loaded files copied to my target location.
Upvotes: 1