how to improve performance when we try convert huge Pandas data-frame (40-50 million rows) to Spark 2.0 dataframe

Question

I am trying to convert my pandas dataframe to spark 2.0 dataframe using below code:

spark_df= sqlContext.createDataFrame(pandas_df)

I have couple of questions:

I want to understand what happens internally when we try to convert pandas dataframe to spark dataframe. As i understand what happens internally when we try to convert spark to pandas using toPandas() method like whole things comes to a driver etc.
I am converting pandas to spark but it taking too much time it seems more than 10-12 hours. one reason i can think of is because pandas dataframe has 43 million rows approx. looking forward to know is any way i can get some performance gain ?? or if i provide explicitly schema will it help ? or any suggestions ?

Answers (1)