Jaxer
Jaxer

Reputation: 37

Delta Lake for AWS Glue notebook setup

I wolud like to set up the delta lake format on AWS Glue and do the simple ETL finishing with df.write.format("delta").mode("overwrite").save.(s3) Can someone please provide me with copy-paste code for this?

Im using this:

{
  "--datalake-formats": "delta"
}
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

and then according to documentation -

Create a key named --conf for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using SparkConf in your script. These settings help Apache Spark correctly handle Delta Lake tables.

spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
import pyspark
from delta.tables import *
from delta import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

#  Create a spark session with Delta
builder = pyspark.sql.SparkSession.builder.appName("DeltaTutorial") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

# Create spark context
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

And then I got the error:

ModuleNotFoundError: No module named 'delta'

or without import part

#  Create a spark session with Delta
builder = pyspark.sql.SparkSession.builder.appName("DeltaTutorial") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

# Create spark context
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

error

NameError: name 'configure_spark_with_delta_pip' is not defined

Upvotes: 1

Views: 1115

Answers (2)

tallwithknees
tallwithknees

Reputation: 311

Quite late to this but for anyone else struggling, if via a normap python script, in the section where you configure the spark session with delta configure_spark_with_delta_pip you also need to add the maven reference extra_packages to the config like below:


 builder = (
        SparkSession.builder.\
            .config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0")\ # <<<-- here is the change
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    )


spark = configure_spark_with_delta_pip(
                       spark_session_builder=builder,
####  below is the change
                       extra_packages=["io.delta:delta-core_2.12:2.2.0",
                                       <another_package>]) 
.getOrCreate()

Upvotes: 0

Rahul
Rahul

Reputation: 44

You need to install an additional python module "delta-spark" to use delta api. Use the parameter:

Key: additional_python_modules 
Value: delta-spark

I have tested by launching a Glue Notebook.

Added my first cell as below with module to install:

%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2
%additional_python_modules delta-spark
%%configure
{
"conf":"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog",
"datalake-formats":"delta"
}

Upvotes: 1

Related Questions