Vidya821
Vidya821

Reputation: 77

Read /Write delta lake tables on S3 using AWS Glue jobs

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"

 from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *

data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")

Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job. Can anyone help me on this Thanks

Upvotes: 1

Views: 7149

Answers (3)

EzuA
EzuA

Reputation: 56

I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job. Here you have the list of them (I am using Deltalake 0.6.1):

  • com.ibm.icu_icu4j-58.2.jar
  • io.delta_delta-core_2.11-0.6.1.jar
  • org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
  • org.antlr_antlr4-4.7.jar
  • org.antlr_antlr4-runtime-4.7.jar
  • org.antlr_antlr-runtime-3.5.2.jar
  • org.antlr_ST4-4.0.8.jar
  • org.glassfish_javax.json-1.0.4.jar

Then in your Glue job you can use the following code:

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")

from delta.tables import *

glueContext = GlueContext(sc)
spark = glueContext.spark_session

delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)

deltaTable = DeltaTable.forPath(spark, delta_path)

Upvotes: 2

zsxwing
zsxwing

Reputation: 20826

Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

Upvotes: 1

Shubham Jain
Shubham Jain

Reputation: 5526

You need to pass the additional configuration properties

--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Upvotes: 1

Related Questions