Kilipukki
Kilipukki

Reputation: 393

Azure Data Factory: output dataset file name from input dataset folder name

I'm trying to solve following scenario in Azure Data Factory:

I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.

So when my input looks something like this...

DATE=2021-01-01/
    data-file-001.parquet
    data-file-002.parquet
    data-file-003.parquet
DATE=2021-01-02/
    data-file-001.parquet
    data-file-002.parquet

...my output should look something like this:

output-data/
    data_2021-01-01_1.csv
    data_2021-01-01_2.csv
    data_2021-01-01_3.csv
    data_2021-01-02_1.csv
    data_2021-01-02_2.csv

Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.

Upvotes: 1

Views: 1710

Answers (1)

Joseph  Xu
Joseph Xu

Reputation: 6043

Update:
My Get Metadata1 activity, set the container input as:
enter image description here Set the container input as follows:
enter image description here My debug info is as follows: enter image description here

I think I've found the solution. I'm using csv files for example.
My input looks something like this

container:input 
    2021-01-01/
      data-file-001.csv
      data-file-002.csv
      data-file-003.csv
    2021-01-02/
      data-file-001.csv
      data-file-002.csv

My debug result is as follows:
enter image description here

enter image description here

  1. Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
    enter image description here

  2. Inside the ForEach1 activity, we now using data flow to move data.
    enter image description here

  3. Set the source dataset to the container and declare a parameter FolderName. enter image description here

  4. Then add dynamic content @dataset().FolderName to the source dataser. enter image description here

  5. Back to the ForEach1 activity, we can add dynamic content @item().name to parameter FolderName. enter image description here

  6. Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv. enter image description here

  7. Then we can process this column to get the file name we want via DerivedColumn1. Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]). enter image description here

  8. In the Settings of sink, we can select Name file as column data and File_Name. enter image description here

That's all.

Upvotes: 2

Related Questions