svn
svn

Reputation: 179

pandas df to spark df conversion takes long time on Databricks notebook

I have a pandas df which is over 10 million in rows. I'm trying to convert this pandas df to spark df using the below method.

spark_session = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
# Pandas to Spark
spark_df = spark_session.createDataFrame(pandas_df)

This process is taking ~9 minutes to convert pandas df to spark df of 10 million rows on Databricks. Which is too long.

Is there any other way where I can convert it faster?

Thanks. Appreciate the help.

Upvotes: 2

Views: 4112

Answers (1)

ziad.rida
ziad.rida

Reputation: 374

What driver node size did you use?

One more thing, Did you do this:

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

Check https://docs.databricks.com/spark/latest/spark-sql/spark-pandas.html

Upvotes: 2

Related Questions