Satish V.S
Satish V.S

Reputation: 1

Databricks notebooks lineage in Azure Purview

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?

Upvotes: 0

Views: 2077

Answers (2)

Will J
Will J

Reputation: 357

You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.

  • Supports table level lineage from Spark Notebooks and jobs for the following data sources:
    • Azure SQL
    • Azure Synapse Analytics
    • Azure Data Lake Gen 2
    • Azure Blob Storage
    • Delta Lake
  • Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
  • Databricks Runtimes between 6.4 and 10.3 are currently supported
  • Can be configured per cluster or for all clusters as a global configuration
  • Once configured, does not require any code changes to notebooks or jobs

Upvotes: 2

alpinedelight
alpinedelight

Reputation: 61

Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.

Here's an example of where Spline was used to track lineage from notebooks: https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/

This article talks about how to get started with the Purview REST API: https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058

Upvotes: 2

Related Questions