Reputation: 63
I want to reset the spark.sql.shuffle.partitions configure in the pyspark code, since I need to join two big tables. But the following code doesn't not work in the latest spark version, the error says that "no method "setConf" in xxx"
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
spark.sparkContext.setConf("spark.sql.shuffle.partitions", "1000")
spark.sparkContext.setConf("spark.default.parallelism", "1000")
# or using the follow, neither is working
spark.setConf("spark.sql.shuffle.partitions", "1000")
spark.setConf("spark.default.parallelism", "1000")
I would like to know how to reset the "spark.sql.shuffle.partitions" now.
Upvotes: 4
Views: 24887
Reputation: 69
Please beware that we discovered a defect in the Spark SQL "Group By" / "Distinct" implementation when the shuffle partitions is set to greater than 2000. We tested with a data-set of around 3000 records, with 38 columns which had about 1800 unique records with 38 columns.
When we ran the "Distinct" or "Group By" query with the 38 columns and "spark.sql.shuffle.partitions" set to 2001, the count of distinct records was coming as less than 1800, say 1794. However, when we set it to 2000, the same query gave us record count as 1800.
So basically, Spark is incorrectly dropping a few records when the shuffle partitions is greater than 2000.
We tested with Spark v2.3.1 and will file a Bug Jira soon. I need to prepare a test data to demonstrate, but we have confirmed it with our real-world dataset already.
Upvotes: 2
Reputation: 960
SparkSession provides a RuntimeConfig interface to set and get Spark related parameters. The answer to your question would be:
spark.conf.set("spark.sql.shuffle.partitions", 1000)
Refer: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RuntimeConfig
I've missed that your question was about pyspark. Pyspark has a similar interface spark.conf
.
Refer: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.conf
Upvotes: 9