user2805507
user2805507

Reputation:

Spark dataframe to pandas profiling

I am trying to achieve a data profiling with pandas-profiling library. i am fetching data directly from hive. this is the error i am receiving

Py4JJavaError: An error occurred while calling o114.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 4 times, most recent failure: Lost task 2.3 in stage 14.0 (TID 65, bdgtr026x30h4.nam.nsroot.net, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 15823824. To avoid this, increase spark.kryoserializer.buffer.max value.

i tried to set my spark on jupyter notebook in python but i am receiving the same error

spark.conf.set("spark.kryoserializer.buffer.max", "512")
spark.conf.set('spark.kryoserializer.buffer.max.mb', 'val')

based on my code, am imissing any steps?

df = spark.sql('SELECT id,acct from tablename').cache()
report = ProfileReport(df.toPandas())

Upvotes: 1

Views: 3266

Answers (2)

Climbs_lika_Spyder
Climbs_lika_Spyder

Reputation: 6714

It looks like support is coming! Github thread TLDR; alpha release expected Jan 2022!

Upvotes: 0

Shubham Jain
Shubham Jain

Reputation: 5526

Instead of setting the configuration in jupyter set the configuration while creating the spark session as once the session is created the configuration doesn't changes.

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.kryoserializer.buffer.max", "512m") \
.config('spark.kryoserializer.buffer', '512k') \
.getOrCreate()

You can get the properties detail here

Upvotes: 1

Related Questions