Reputation: 11
I have an Azure Data Factory requirement. There are 50 csv files and each file is named like Product, Department, Employee, Sales, etc. Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file. Every day, I receive a new set of files in an Azure Data Lake Storage Gen 2 folder at 11 PM CST. At 12:05 AM CST, each of these files should be loaded into its respective table.
There should be only one pipeline or there can be 2 pipelines where the parent pipeline collects the metadata of the file and supplies it to the child pipeline which does the data load. It should find the files with the timestamp of the previous day and then it should loop through these files and load them into its respective target table, one by one. Can someone briefly explain the Activities and Transformations I need to use to fulfil this requirement.
I am new to ADF. I haven't tried anything so far.
Upvotes: 0
Views: 124
Reputation: 11539
Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file.
As you have same columns for both source and target and same names for files and tables. The below process will work for you if you have same schema for both.
First Use Get Meta data activity for the source folder to get the files list.
To get latest uploaded files, use Filter by last modified
option in the Get meta data. This option only supports UTC time format and CST equals UTC-6. Give the Start time and End time as per your requirement by cross checking both timezones. Use appropriate Date time functions for it.
For sample I have given like below.
which will give the result array like this.
Give this ChildItems array @activity('Get Metadata1').output.childItems
to a ForEach activity. Inside ForEach use a copy activity to copy each file iteration wise.
Create another Source dataset and create a dataset parameter (Here sourcefiename
) and give it like below.
Give this to copy activity source and assign @item().name
for the parameter value.
In sink, create a Database dataset with two dataset parameters schema
and table_name
. Use these like below.
from @item().name
extract the file name other '.csv'
text by using split and give that to the above parameter.
@split(item().name,'.c')[0]
Now, schedule this pipeline as per your time zone using schedule trigger.
Upvotes: 1