Reputation: 323
We were able to read the files by specifiying the delta file source as a parquet dataset in ADF. Although this reads the delta file, it ends up reading all versions/snapshots of the data in the delta file instead of specifically picking up the most recent version of the delta data.
There is a similar question here - Is it possible to connect to databricks deltalake tables from adf
However, I am looking to read the delta file from an ADLS Gen2 location. Appreciate any guidance on this.
Upvotes: 2
Views: 7649
Reputation: 5575
Time has passed and now ADF Delta support for Data Flow is in preview... hopefully it makes it into ADF native soon. https://learn.microsoft.com/en-us/azure/data-factory/format-delta
Upvotes: 2
Reputation: 81
I don't think you can do it as easily as reading from Parquet files today, because the Delta Lake files are basically transaction log files + snapshots in Parquet format. Unless you VACUUM every time before you read from a Delta Lake directory, you are going to end up readying the snapshot data like you have observed.
Delta Lake files do not play very nicely OUTSIDE OF Databricks.
In our data pipeline, we usually have a Databricks notebook that exports data from Delta Lake format to regular Parquet format in a temporary location. We let ADF read the Parquet files and do the clean up once done. Depending on the size of your data and how you use it, this may or may not be an option for you.
Upvotes: 5