Reading Databricks tables in Azure

Question

Please clarify my confusion as I keep hearing we need read every Parquet file created by Databricks Delta tables to get to latest data in case of a SCD2 table. Is this true?

Can we simply use SQL and get the latest row? Can we use some date/time columns to get to history of changes to that row?

Thanks

Phuri Chalermkiatsakul · Accepted Answer

You can create and manage a Delta table as SCD2 if you wish. Actually, what confuses you must be the Time Travel feature. It just allow you to rollback your Delta table to any of its historical states which are automatically managed by Databricks.

Answer:

No. Instead, we need to look into all Delta lake transaction logs which are stored as JSON files inside _delta_log folder along with Parquet files and then see which parquet files are used as the latest data. See this doc for more information.
Yes. We can use SQL to get the latest row. Actually, any SQL script execution will always return latest data. However, there are 2 ways to get historical version of the table by SQL which are 1) specifying "as of version" and 2) specifying "as of timestamp". Read more.

PS This is a really good question.

Reading Databricks tables in Azure

Answers (1)

Related Questions