j1897
j1897

Reputation: 1557

Is there a more efficient method to convert pandas dataframe to Spark dataframe?

I have a pandas dataframe data_pandas which has about half a million rows and 30000 columns. I want this to be in a Spark dataframe data_spark and I achieve this by:

data_spark = sqlContext.createDataFrame(data_pandas)

I am working on an r3.8xlarge driver with 10 workers of the same configuration. But the aforementioned operation takes forever and returns an OOM error. Is there an alternate method I can try?

The source data in in HDF format, so I can't read it directly as a Spark dataframe.

Upvotes: 2

Views: 3551

Answers (2)

Snail Pacer
Snail Pacer

Reputation: 91

You can try using arrow which can make it more efficient.

spark.conf.set("spark.sql.execution.arrow.enabled","true)

For more details refer: https://bryancutler.github.io/toPandas/

Upvotes: 1

Gaurav Dhama
Gaurav Dhama

Reputation: 1336

One way can be to read the data from the pandas dataframe in batches rather than at one go, one way would be to use the code below which divides it into 20 chunks (some part of the solution from the question here and here)

def unionAll(*dfs):
    ' by @zero323 from here: http://stackoverflow.com/a/33744540/42346 '
    first, *rest = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

df_list = []
for chunk in np.array_split(df1,20):
    df_list.append(sqlContext.createDataFrame(chunk))

df_all = unionAll(df_list)

Upvotes: 0

Related Questions