Reputation: 1557
I have a pandas dataframe data_pandas
which has about half a million rows and 30000 columns. I want this to be in a Spark dataframe data_spark
and I achieve this by:
data_spark = sqlContext.createDataFrame(data_pandas)
I am working on an r3.8xlarge driver with 10 workers of the same configuration. But the aforementioned operation takes forever and returns an OOM error. Is there an alternate method I can try?
The source data in in HDF format, so I can't read it directly as a Spark dataframe.
Upvotes: 2
Views: 3551
Reputation: 91
You can try using arrow which can make it more efficient.
spark.conf.set("spark.sql.execution.arrow.enabled","true)
For more details refer: https://bryancutler.github.io/toPandas/
Upvotes: 1
Reputation: 1336
One way can be to read the data from the pandas dataframe in batches rather than at one go, one way would be to use the code below which divides it into 20 chunks (some part of the solution from the question here and here)
def unionAll(*dfs):
' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
first, *rest = dfs # Python 3.x, for 2.x you'll have to unpack manually
return first.sql_ctx.createDataFrame(
first.sql_ctx._sc.union([df.rdd for df in dfs]),
first.schema
)
df_list = []
for chunk in np.array_split(df1,20):
df_list.append(sqlContext.createDataFrame(chunk))
df_all = unionAll(df_list)
Upvotes: 0