jacobe
jacobe

Reputation: 401

SparkSession.newSession with distinct SQLConf

tl;dr: How do I use SparkSession.newSession with changes to the SQL config?

I'm using PySpark within AWS Glue, creating a Glue 5 notebook.

I'd like to have two different SparkSessions, with different SQL configs (two different warehouses). Everything is iceberg.

I can easily set up a session that works fine doing something like this:

warehouse_path = "s3://some_s3_bucket/path"
spark = SparkSession.builder \
    .config("spark.sql.warehouse.dir", warehouse_path) \
    .config("spark.sql.catalog.glue_catalog.warehouse", warehouse_path) \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.parquet.compression.codec", "gzip") \
    .getOrCreate()

So, to have two different sessions, with different warehouse paths, I attempt to do something like this:

spark = SparkSession.builder \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.parquet.compression.codec", "gzip") \
    .getOrCreate()

warehouse_path_1 = "s3://s3_bucket_1/path"
spark_session_1 = spark.newSession()
spark_session_1.conf.set("spark.sql.warehouse.dir", warehouse_path_1)
spark_session_1.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_1)

warehouse_path_2 = "s3://s3_bucket_2/path"
spark_session_2 = spark.newSession()
spark_session_2.conf.set("spark.sql.warehouse.dir", warehouse_path_2)
spark_session_2.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_2)

(I also tried the same thing with all of the sql confs being set on the child sessions, not just the changed ones, with the same results)

I end up with this error (or a similar one for whichever sql conf I try to change first:

AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir

On the one hand, I understand that the the SQLConf is static, but if you look at the docs for newSession it says (emphasis mine):

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

So, if it has a "separate SQLConf", how can I actually set it up with different SQL options?

Upvotes: 0

Views: 27

Answers (0)

Related Questions