Reputation: 393
I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Upvotes: 1
Views: 1710
Reputation: 6043
Update:
My Get Metadata1 activity, set the container input
as:
Set the container input
as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1
activity to get the folder list and then using ForEach1
activity to iterate this list.
Inside the ForEach1
activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName
.
Then add dynamic content @dataset().FolderName
to the source dataser.
Back to the ForEach1
activity, we can add dynamic content @item().name
to parameter FolderName
.
Key in File_Name
to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv
.
Then we can process this column to get the file name we want via DerivedColumn1
.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5])
.
In the Settings of sink, we can select Name file as column data
and File_Name
.
That's all.
Upvotes: 2