Reputation: 73
I want to convert Dask Dataframe to Spark Dataframe.
Let's consider this example:
import dask.dataframe as dd
dask_df = dd.read_csv("file_name.csv")
# convert dask df to spark df
spark_df = spark_session.createDataFrame(dask_df)
But this is not working. Is there any alternative to do this. Thanks in advance.
Upvotes: 2
Views: 2205
Reputation: 19308
Writing the Spark DataFrame to disk with Dask and reading it with Spark is the best for bigger datasets.
Here's how you can convert smaller datasets.
pandas_df = dask_df.compute()
pyspark_df = spark.createDataFrame(pandas_df)
I don't know of an in-memory way to convert a Dask DataFrame to a Spark DataFrame without a massive shuffle, but that'd certainly be a cool feature.
Upvotes: 3
Reputation: 28673
You best alternative is to save the dataframe to files, e.g., the parquet format: dask_df.to_parquet(...)
. If your data is small enough, you could load it entirely into the client and feed the resultant pandas dataframe to Spark.
Although it's possible to co-locate spark and dask workers on nodes, they will no be in direct communication with each other, and streaming large data via the client is not a good idea.
Upvotes: 1