Reputation: 670
I am trying to convert my pandas dataframe to spark 2.0 dataframe using below code:
spark_df= sqlContext.createDataFrame(pandas_df)
I have couple of questions:
Upvotes: 2
Views: 947
Reputation: 2392
Why are you creating a Spark DF from a pandas DF of this size. It doesn't make any sense. It's a huge overhead since you're loading in your data into memory through pandas DF and then again in Spark. Not sure how your settings are like memory, cluster size etc. but if you are on your local machine, this can eat up your memory.
My suggestion, since pandas DF has a relational format, I guess that you're creating your DataFrame from csv files (or any other like tsv etc.). And the better solution would to load it directly in a Spark DataFrame through the DataFrameReader
. You can also pass the schema, then loading will even be faster.
Upvotes: 1