Mariah Akinbi
Mariah Akinbi

Reputation: 386

Get Last Modified Date on Partitioned Data Using Azure Data Factory

I am ultimately trying to get the names of the partitions (ie DATE_ID=20211004) that have been modified within the last 24 hours. I've tried using the Get Metadata activity, but it seems to only iterate partition folders, and not the files within the partitioned folders. I need to get the last modified date of the partitioned files within the folders, but I'm not sure how to do that.

Folder structure:

product_data/

|---------DATE_ID=20211002/

|---------|---------part-00011-tid-345678900-abc123-6793-1.c000.snappy.parquet

|---------|---------committed_123456789

|---------DATE_ID=20211003/

|---------|---------part-00086-tid-345678900-abc123-6756-2.c000.snappy.parquet

|---------|---------committed_123456789

|---------DATE_ID=20211004/

|---------|---------part-00042-tid-345678900-abc123-6712-1.c000.snappy.parquet

|---------|---------committed_123456789

In the example above, I'd like to get the last modified date and full name of each of the parquet files.

I've tried a variety of For Each + get Metadata activity combinations such as the one below:

https://ibb.co/B4vMF6j

Upvotes: 0

Views: 6030

Answers (1)

KarthikBhyresh-MT
KarthikBhyresh-MT

Reputation: 5034

You would have to set Start time (UTC) to @getPastTime(24,'Hour') and leave the End time (UTC) as NULL to fetch all the files modified within last 24hrs from current timestamp. Using this would only output childitems, that are files and not folders. This is by design.

Create source dataset with path being the root for partitioned data. Use Get Metadata activity to list the files in that folder.

enter image description here

Assign the output list of files to an array variable:

@activity('Get Metadata').output.childItems

enter image description here

Next, you can use the meta data option lastModified to get the Last modified datetime of each file or folder.

Inside foreach activity @variables('items'), create a clone of source file and add a parameter to dynamically get the files based on each file we receive inside foreach.

enter image description here

enter image description here

Inside foreach activity:

enter image description here

We then append these lastmodified timestamp received to another variable (you can customize this as per your need)

 @concat(item().name, '  ', 'LastModified:' , activity('Get Metadata1').output.lastModified)

enter image description here

Finally to list all the results, store the result from appended variable to result variable named final.

   {
    "name": "final",
    "value": [
        "SourceAAD.csv  LastModified:2021-09-29T13:15:07Z",
        "mysource.csv  LastModified:2021-09-27T11:56:33Z",
        "new adf.xlsx  LastModified:2021-09-29T13:16:01Z",
        "sourceDataTime.parquet  LastModified:2021-09-13T05:42:43Z",
        "table3_CustomerAddress.csv  LastModified:2021-10-04T06:01:12Z",
        "table4_product.csv  LastModified:2021-10-04T06:01:13Z",
        "table5_ProductCategory.csv  LastModified:2021-10-04T06:01:12Z",
        "table6_ProductDescription.csv  LastModified:2021-10-04T06:01:12Z",
        "testSource.json  LastModified:2021-08-30T04:19:54Z",
        "testSource.txt  LastModified:2021-08-30T04:17:56Z",
        "userdata2.parquet  LastModified:2021-09-13T03:57:27Z",
        "userdataCopy.parquet  LastModified:2021-09-13T03:57:28Z",
        "userdataOriginal.parquet  LastModified:2021-09-13T03:56:51Z"
    ]
}

enter image description here

Upvotes: 1

Related Questions