Vishal
Vishal

Reputation: 95

Not able to save large spark dataframe as pickle

I have large dataframe (little more than 20G), trying to save that as pickle object to be later used in the another process.

I have tried different configuration, below are the latest one.

executor_cores=4                                                                
executor_memory='20g'                                                           
driver_memory='40g'                                                                
deploy_mode='client'                                                            
max_executors_dynamic='spark.dynamicAllocation.maxExecutors=400'                   
num_executors_static=300                                                        
spark_driver_memoryOverhead='5g'                                                 
spark_executor_memoryOverhead='2g'                                               
spark_driver_maxResultSize='8g'                                                 
spark_kryoserializer_buffer_max='1g'                                            

Note:- I cannot increase spark_driver_maxResultSize more than 8G.

I have also tried saving dataframe as hdfs files and then tried to save it as pickel but getting same error messsage as earlier.

My understanding is, when we use pandas.pickle it brings all the data into one driver and then create pickle object. As data size is more than driver_max_result_size code is failing. (Code has worked earlier for 2G data).

Do we have any worksround to solve this problem?

big_data_frame.toPandas().to_pickle('{}/result_file_01.pickle'.format(result_dir))

big_data_frame.write.save('{}/result_file_01.pickle'.format(result_dir), format='parquet', mode='append')

df_to_pickel=sqlContext.read.format('parquet').load(file_path) 
df_to_pickel.toPandas().to_pickle('{}/scoring__{}.pickle'.format(afs_dir, rd.strftime('%Y%m%d')))

Error message

Py4JJavaError: An error occurred while calling o1638.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 955 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

Upvotes: 0

Views: 3526

Answers (1)

secretive
secretive

Reputation: 2104

Saving as pickle file is an RDD function in Spark, not dataframe. To save your frame using pickle, run

big_data_frame.rdd.saveAsPickleFile(filename)

If you are working with big data, it is never a good idea to run either collect or toPandas in spark as it collects everything in memory, crashing the system. I would suggest you to use parquet or any other format for saving your data as RDD functions are in maintenance mode, which means spark is not introducing any new features to it rapidly.

To read the file, try

pickle_rdd = sc.pickleFile(filename).collect()
df = spark.createDataFrame(pickle_rdd)

Upvotes: 2

Related Questions