Reputation: 71
I am currently accessing deltalake table from databricks notebook using spark. However now I need to access delta tables from MLflow project. MLflow spark api only allows logging and loading of SparkML models. Any idea on how can I accomplish this?
Currently I am trying to access spark via this code in MLflow project:
spark = pyspark.sql.SparkSession._instantiatedSession
if spark is None:
# NB: If there is no existing Spark context, create a new local one.
# NB: We're disabling caching on the new context since we do not need it and we want to
# avoid overwriting cache of underlying Spark cluster when executed on a Spark Worker
# (e.g. as part of spark_udf).
spark = ( pyspark.sql.SparkSession.builder \
.config("spark.python.worker.reuse", True)
.config("spark.databricks.io.cache.enabled", False)
# In Spark 3.1 and above, we need to set this conf explicitly to enable creating
# a SparkSession on the workers
.config("spark.executor.allowSparkContext", "true")
.master("local[*]")
.appName("MLflow Project")
.getOrCreate()
)
But I am getting this error:
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
Upvotes: 1
Views: 832
Reputation: 11
Running MLflow projects within Databricks notebooks (e.g., against an existing interactive cluster from within a notebook attached to that cluster) isn’t currently well-supported for several reasons (including e.g., the lack of auth propagation to the subprocess created to run the project).
Upvotes: 1
Reputation: 87154
It should be done the same way as for normal Spark projects that doesn't run in Notebooks:
spark-submit
or pyspark
, you need to install delta-spark
package to use code completion, etc. (--conf
could be set from code itself, see next step):pyspark --packages io.delta:delta-core_2.12:1.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
SparkSession
object and use it:import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
These steps are covered in the Quickstart guide included into Delta documentation.
If data resides on Azure Data Lake Storage, or AWS S3, or GCP, you may need to add additional packages & configurations, but that's also covered in documentation.
Upvotes: 1