S.Khan
S.Khan

Reputation: 71

Accessing Delta Lake Table in Databricks via Spark in MLflow project

I am currently accessing deltalake table from databricks notebook using spark. However now I need to access delta tables from MLflow project. MLflow spark api only allows logging and loading of SparkML models. Any idea on how can I accomplish this?

Currently I am trying to access spark via this code in MLflow project:


spark = pyspark.sql.SparkSession._instantiatedSession
if spark is None:
  # NB: If there is no existing Spark context, create a new local one.
  # NB: We're disabling caching on the new context since we do not need it and we want to
  # avoid overwriting cache of underlying Spark cluster when executed on a Spark Worker
  # (e.g. as part of spark_udf).
  spark = ( pyspark.sql.SparkSession.builder \
   .config("spark.python.worker.reuse", True)
   .config("spark.databricks.io.cache.enabled", False)
   # In Spark 3.1 and above, we need to set this conf explicitly to enable creating
   # a SparkSession on the workers
   .config("spark.executor.allowSparkContext", "true")
   .master("local[*]")
   .appName("MLflow Project")
   .getOrCreate()
  )

But I am getting this error:

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

Upvotes: 1

Views: 832

Answers (2)

pradpalnis
pradpalnis

Reputation: 11

Running MLflow projects within Databricks notebooks (e.g., against an existing interactive cluster from within a notebook attached to that cluster) isn’t currently well-supported for several reasons (including e.g., the lack of auth propagation to the subprocess created to run the project).

Upvotes: 1

Alex Ott
Alex Ott

Reputation: 87154

It should be done the same way as for normal Spark projects that doesn't run in Notebooks:

  • Add dependencies to spark-submit or pyspark, you need to install delta-spark package to use code completion, etc. (--conf could be set from code itself, see next step):
pyspark --packages io.delta:delta-core_2.12:1.1.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
  • create a SparkSession object and use it:
import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

These steps are covered in the Quickstart guide included into Delta documentation.

If data resides on Azure Data Lake Storage, or AWS S3, or GCP, you may need to add additional packages & configurations, but that's also covered in documentation.

Upvotes: 1

Related Questions