Not able to save large spark dataframe as pickle

Question

I have large dataframe (little more than 20G), trying to save that as pickle object to be later used in the another process.

I have tried different configuration, below are the latest one.

executor_cores=4                                                                
executor_memory='20g'                                                           
driver_memory='40g'                                                                
deploy_mode='client'                                                            
max_executors_dynamic='spark.dynamicAllocation.maxExecutors=400'                   
num_executors_static=300                                                        
spark_driver_memoryOverhead='5g'                                                 
spark_executor_memoryOverhead='2g'                                               
spark_driver_maxResultSize='8g'                                                 
spark_kryoserializer_buffer_max='1g'

Note:- I cannot increase spark_driver_maxResultSize more than 8G.

I have also tried saving dataframe as hdfs files and then tried to save it as pickel but getting same error messsage as earlier.

My understanding is, when we use pandas.pickle it brings all the data into one driver and then create pickle object. As data size is more than driver_max_result_size code is failing. (Code has worked earlier for 2G data).

Do we have any worksround to solve this problem?

big_data_frame.toPandas().to_pickle('{}/result_file_01.pickle'.format(result_dir))

big_data_frame.write.save('{}/result_file_01.pickle'.format(result_dir), format='parquet', mode='append')

df_to_pickel=sqlContext.read.format('parquet').load(file_path) 
df_to_pickel.toPandas().to_pickle('{}/scoring__{}.pickle'.format(afs_dir, rd.strftime('%Y%m%d')))

Error message

Py4JJavaError: An error occurred while calling o1638.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 955 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

Not able to save large spark dataframe as pickle

Error message

Answers (1)

Related Questions