user07
user07

Reputation: 670

how to improve performance when we try convert huge Pandas data-frame (40-50 million rows) to Spark 2.0 dataframe

I am trying to convert my pandas dataframe to spark 2.0 dataframe using below code:

spark_df= sqlContext.createDataFrame(pandas_df)

I have couple of questions:

  1. I want to understand what happens internally when we try to convert pandas dataframe to spark dataframe. As i understand what happens internally when we try to convert spark to pandas using toPandas() method like whole things comes to a driver etc.
  2. I am converting pandas to spark but it taking too much time it seems more than 10-12 hours. one reason i can think of is because pandas dataframe has 43 million rows approx. looking forward to know is any way i can get some performance gain ?? or if i provide explicitly schema will it help ? or any suggestions ?

Upvotes: 2

Views: 947

Answers (1)

Dat Tran
Dat Tran

Reputation: 2392

Why are you creating a Spark DF from a pandas DF of this size. It doesn't make any sense. It's a huge overhead since you're loading in your data into memory through pandas DF and then again in Spark. Not sure how your settings are like memory, cluster size etc. but if you are on your local machine, this can eat up your memory.

My suggestion, since pandas DF has a relational format, I guess that you're creating your DataFrame from csv files (or any other like tsv etc.). And the better solution would to load it directly in a Spark DataFrame through the DataFrameReader. You can also pass the schema, then loading will even be faster.

Upvotes: 1

Related Questions