Gregor G.
Gregor G.

Reputation: 77

Get metadata from Blob storage with "folder like structure" using Azure Data Factory pipeline

I will get straight to the point. This is the problem:

I have an Azure storage account with Blob storage in which I have multiple containers. In these containers, I do have a "folder-like structure" made up of directories and subdirectories (I guess this would be proper terminology for it because in the dataset I do have field with "Directory" right after container as you can see in the picture. Screenshow from Azure with Container->Directory->File

The structure is following(for simplicity I will make it shorter but still representative): Structure of data storage - Container->Years->Months->CSVDocs

I need to get Metadata from the CSV files (particularly name of the file) so I can add aditional logic to the pipeline so it knows what files to copy. What is the best solution to get these filenames?

I have tried to use For Each statement. First of all I created Dateset where I only specified the container name and I used it in the Get Metadata activity where I got output in form of list of years (I listed childitems). Then I created another Dataset but this time parametrized where I defined directory as @dataset().FileName (I did not define the file name). I used this dataset in the For Each loop with Get Matadata activity where I was able to get list of numbers of months like you can see in the file structure above. Then I went on to create third dataset(I thought this was already dumb but I gave it a shot) where I wanted to include two parameters in the directory field which would be concatenated. Here I found out that I could not use the parameter of previous dataset in another dataset. So i thought maybe I could use variable... I was not able to use this also because I got error everytime I wanted to use variable in "Add dynamic content". So then I tried to use dataset where I defined only container and file name but I ended up with getting results only for default value set for file name at the dataset level.

Current structure of pipeline in ADF

Since I am quite new to ADF and creating pipelines I wonder what am I missing. What would be your proposed solution to get the file names of the CSV docs so I can use them later on within the pipeline?

Upvotes: 0

Views: 4622

Answers (1)

NiharikaMoola
NiharikaMoola

Reputation: 5074

I have repro’d by iterating through multiple sub-folders inside For Each activity using execute pipeline activity.

Source dataset:

Create a dataset for the source and add the dataset parameter for passing the value dynamically.

enter image description here

Main pipeline:

  1. Using the Get Metadata activity, get the folders inside the given container.

enter image description here

enter image description here

  1. Pass the child items to the ForEach activity. Inside ForEach, add execute pipeline to call another pipeline to get the subfolder for each current item (@item().name).

enter image description here

Child pipeline1 (to get the subfolders):

  1. In the child pipeline, create a pipeline parameter to get the current item name (main folder name) from the parent pipeline.

enter image description here

  1. Using the Get Metadata activity, get the subfolders list. Use the parameters in the dataset.

Dataset property value: @concat(pipeline().parameters.dir1,'/')

enter image description here

enter image description here

  1. Pass the child items to ForEach activity. Inside ForEach, you use filter activity to filter out the sub folder name if required. Then pass the required current item to execute pipeline activity to call the child pipeline (which gets the files from each sub folder). Pass the child pipeline parameter value from here.

@concat(pipeline().parameters.dir1,'/',item().name,'/')

enter image description here

Child pipeline2 (gets the files and process):

  1. Create the pipeline parameter to get the value from its parent pipeline.

enter image description here

  1. Using the Get Metadata activity get the files from each sub folder by passing the parameter value to the dataset parameter.

enter image description here

enter image description here

  1. Pass the output child items to ForEach activity. Inside ForEach, you can use filter activity to filter out the files.

  2. Using Copy data activity to can copy the required files to the sink.

Dataset properties:

Dir - @concat(pipeline().parameters.path,'/',item().name)

enter image description here

Upvotes: 1

Related Questions