Xyz
Xyz

Reputation: 73

Convert Dask Dataframe to Spark dataframe using Python

I want to convert Dask Dataframe to Spark Dataframe.

Let's consider this example:

import dask.dataframe as dd
dask_df = dd.read_csv("file_name.csv")

# convert dask df to spark df
spark_df = spark_session.createDataFrame(dask_df)

But this is not working. Is there any alternative to do this. Thanks in advance.

Upvotes: 2

Views: 2205

Answers (2)

Powers
Powers

Reputation: 19308

Writing the Spark DataFrame to disk with Dask and reading it with Spark is the best for bigger datasets.

Here's how you can convert smaller datasets.

pandas_df = dask_df.compute()
pyspark_df = spark.createDataFrame(pandas_df) 

I don't know of an in-memory way to convert a Dask DataFrame to a Spark DataFrame without a massive shuffle, but that'd certainly be a cool feature.

enter image description here

Upvotes: 3

mdurant
mdurant

Reputation: 28673

You best alternative is to save the dataframe to files, e.g., the parquet format: dask_df.to_parquet(...). If your data is small enough, you could load it entirely into the client and feed the resultant pandas dataframe to Spark.

Although it's possible to co-locate spark and dask workers on nodes, they will no be in direct communication with each other, and streaming large data via the client is not a good idea.

Upvotes: 1

Related Questions