Reputation: 228
What is the correct way to install the delta module in python??
In the example they import the module
from delta.tables import *
but i did not find the correct way to install the module in my virtual env
Currently i am using this spark param -
"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0"
Upvotes: 14
Views: 35253
Reputation: 1839
Just install the lib:
!pip install pyspark
!pip install delta-spark
And then use as you want
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.12:3.4.1,io.delta:delta-core_2.12:2.4.0 pyspark-shell'
#spark = SparkSession.builder.appName("Basics").getOrCreate()
builder = SparkSession.builder.appName("Basics").master("local") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
my_packages = ["org.apache.hadoop:hadoop-aws:3.3.4",
"org.apache.hadoop:hadoop-client-runtime:3.3.4",
"org.apache.hadoop:hadoop-client-api:3.3.4",
"io.delta:delta-contribs_2.12:3.0.0",
"io.delta:delta-hive_2.12:3.0.0",
"com.amazonaws:aws-java-sdk-bundle:1.12.603",
]
from delta import *
# Create a Spark instance with the builder
# As a result, you now can read and write Delta tables
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
Upvotes: 2
Reputation: 1
I was trying to pip install delta-spark, using a python -m venv, and the pylance wasn't able to find the delta package when trying to import "from delta.tables import *".
Changing from venv to virtualenv solved my problem. So just type pip install virtualenv, then create a new environment, then run pip install delta-spark
Upvotes: 0
Reputation: 9507
If you are facing issues with Jupyter notebook
add the below environment variable
from pyspark.sql import SparkSession
import os
from delta import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.12:3.4.1,io.delta:delta-core_2.12:2.4.0 pyspark-shell'
# RUN spark-shell --packages org.apache.spark:spark-avro_2.12:3.4.1
# RUN spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
builder = SparkSession.builder.appName("SampleSpark") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = builder.getOrCreate()
Upvotes: 0
Reputation: 19328
Here's how you can install Delta Lake & PySpark with conda.
conda env create envs/mr-delta.yml
conda activate mr-delta
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
Upvotes: 1
Reputation: 191
To run Delta locally with PySpark, you need to follow the official documentation.
This works for me but only when executing directly the script (python <script_file>), not with pytest or unittest.
To solve this problem, you need to add this environment variable:
PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.12:1.0.0 pyspark-shell'
Use Scala and Delta version that match your case. With this environment variable, I can run pytest or unittest via cli without any problem
from unittest import TestCase
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
class TestClass(TestCase):
builder = SparkSession.builder.appName("MyApp") \
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
def test_create_delta_table(self):
self.spark.sql("""CREATE IF NOT EXISTS TABLE <tableName> (
<field1> <type1>)
USING DELTA""")
The function configure_spark_with_delta_pip appends a config option in builder object
.config("io.delta:delta-core_<scala_version>:<delta_version>")
Upvotes: 8
Reputation: 446
As the correct answer is hidden in the comments of the accepted solution, I thought I'd add it here.
You need to create your spark context with some extra settings and then you can import delta:
spark_session = SparkSession.builder \
.master("local") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
from delta.tables import *
Annoyingly, your IDE will of course shout at you about this as the package isn't installed and you will also be operating without autocomplete and type hints. I'm sure there's a work around and I will update if I come accross it.
The package itself is on their github here and the readme suggests you can pip install but that doesn't work. In theory you could clone it and install manually.
Upvotes: 11
Reputation: 781
In my case the issue was I had a Cluster running on a Databricks Runtime lower than 6.1
https://docs.databricks.com/delta/delta-update.html
The Python API is available in Databricks Runtime 6.1 and above.
After changing the Databricks Runtime to 6.4 problem disappeared.
To do that: Click clusters -> Pick the one you are using -> Edit -> Pick Databricks Runtime 6.1 and above
Upvotes: -1
Reputation: 20836
Because Delta's Python codes are stored inside a jar and loaded by Spark, delta
module cannot be imported until SparkSession/SparkContext is created.
Upvotes: 7