Reputation: 401
tl;dr: How do I use SparkSession.newSession with changes to the SQL config?
I'm using PySpark within AWS Glue, creating a Glue 5 notebook.
I'd like to have two different SparkSession
s, with different SQL configs (two different warehouses). Everything is iceberg.
I can easily set up a session that works fine doing something like this:
warehouse_path = "s3://some_s3_bucket/path"
spark = SparkSession.builder \
.config("spark.sql.warehouse.dir", warehouse_path) \
.config("spark.sql.catalog.glue_catalog.warehouse", warehouse_path) \
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.parquet.compression.codec", "gzip") \
.getOrCreate()
So, to have two different sessions, with different warehouse paths, I attempt to do something like this:
spark = SparkSession.builder \
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.parquet.compression.codec", "gzip") \
.getOrCreate()
warehouse_path_1 = "s3://s3_bucket_1/path"
spark_session_1 = spark.newSession()
spark_session_1.conf.set("spark.sql.warehouse.dir", warehouse_path_1)
spark_session_1.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_1)
warehouse_path_2 = "s3://s3_bucket_2/path"
spark_session_2 = spark.newSession()
spark_session_2.conf.set("spark.sql.warehouse.dir", warehouse_path_2)
spark_session_2.conf.set("spark.sql.catalog.glue_catalog.warehouse", warehouse_path_2)
(I also tried the same thing with all of the sql confs being set on the child sessions, not just the changed ones, with the same results)
I end up with this error (or a similar one for whichever sql conf I try to change first:
AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir
On the one hand, I understand that the the SQLConf is static, but if you look at the docs for newSession
it says (emphasis mine):
Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.
So, if it has a "separate SQLConf", how can I actually set it up with different SQL options?
Upvotes: 0
Views: 27