Azure Data Factory - How to read only the latest dataset in a Delta format Parquet built from Databricks?

Question

To be clear about the format, this is how the DataFrame is saved in Databricks:

folderpath = "abfss://container@storage.dfs.core.windows.net/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)

This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload. The delta log folder dictates which set of Parquet files in the folder should be read.

In Databricks, i would read the latest dataset for exmaple, by doing the following:

df = spark.read.format("delta").load(folderpath)

How would i do this in Azure Data Factory? I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.

How can i set this up properly?

Azure Data Factory - How to read only the latest dataset in a Delta format Parquet built from Databricks?

Answers (1)

Related Questions