Reputation: 179
I have a pandas df which is over 10 million in rows. I'm trying to convert this pandas df to spark df using the below method.
spark_session = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
# Pandas to Spark
spark_df = spark_session.createDataFrame(pandas_df)
This process is taking ~9 minutes to convert pandas df to spark df of 10 million rows on Databricks. Which is too long.
Is there any other way where I can convert it faster?
Thanks. Appreciate the help.
Upvotes: 2
Views: 4112
Reputation: 374
What driver node size did you use?
One more thing, Did you do this:
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
Check https://docs.databricks.com/spark/latest/spark-sql/spark-pandas.html
Upvotes: 2