Reputation:
I am trying to achieve a data profiling with pandas-profiling library. i am fetching data directly from hive. this is the error i am receiving
Py4JJavaError: An error occurred while calling o114.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 4 times, most recent failure: Lost task 2.3 in stage 14.0 (TID 65, bdgtr026x30h4.nam.nsroot.net, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 15823824. To avoid this, increase spark.kryoserializer.buffer.max value.
i tried to set my spark on jupyter notebook in python but i am receiving the same error
spark.conf.set("spark.kryoserializer.buffer.max", "512")
spark.conf.set('spark.kryoserializer.buffer.max.mb', 'val')
based on my code, am imissing any steps?
df = spark.sql('SELECT id,acct from tablename').cache()
report = ProfileReport(df.toPandas())
Upvotes: 1
Views: 3266
Reputation: 6714
It looks like support is coming! Github thread TLDR; alpha release expected Jan 2022!
Upvotes: 0
Reputation: 5526
Instead of setting the configuration in jupyter set the configuration while creating the spark session as once the session is created the configuration doesn't changes.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.kryoserializer.buffer.max", "512m") \
.config('spark.kryoserializer.buffer', '512k') \
.getOrCreate()
You can get the properties detail here
Upvotes: 1