Reputation: 946
To be clear about the format, this is how the DataFrame is saved in Databricks:
folderpath = "abfss://[email protected]/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)
This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload. The delta log folder dictates which set of Parquet files in the folder should be read.
In Databricks, i would read the latest dataset for exmaple, by doing the following:
df = spark.read.format("delta").load(folderpath)
How would i do this in Azure Data Factory? I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.
How can i set this up properly?
Upvotes: 1
Views: 1785
Reputation: 16411
With Data Factory pipeline, it seems to be hard to achieve that. But I have some ideas for you:
Use lookup active to get the content of delta_log file. If there many files, use get metadata to get the all the files schema(last modified date).
Use an if condition active or swich active to filter the latest data.
After the data filtered, pass the lookup output to set the copy active source(set as parameter).
The hardest thing is that you need figure out how to filter the latest dataset with delta_log. You could try this way, the whole work flow should like this but I can't tell you if it really works. I couldn't test that for you without same environment.
HTP.
Upvotes: 1