santoku
santoku

Reputation: 3427

converting between spark df, parquet object and pandas df

I converted parquet file to pandas without issue but had issue converting parquet to spark df and converting spark df to pandas.

after creating a spark session, I ran these code

spark_df=spark.read.parquet('summarydata.parquet')

spark_df.select('*').toPandas()

It returns error enter image description here

Alternatively, with a parquet object (pd.read_table('summary data.parquet'), how can I convert it to spark df?

The reason I need both spark df and pandas df is that for some smaller DataFrame, I wanna easily use various pandas EDA function, but for some bigger ones I need to use spark sql. And turning parquet to pandas first then to spark df seems a bit of detour.

Upvotes: 0

Views: 3199

Answers (1)

Bipin
Bipin

Reputation: 104

To convert a Pandas Dataframe into Spark dataframe and viceversa, you will have to use pyarrow which is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.

Arrow is available as an optimization when converting a Spark DataFrame to a Pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a Pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, users need to first set the Spark configuration spark.sql.execution.arrow.enabled to true. This is disabled by default.

In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fallback automatically to non-Arrow optimization implementation if an error occurs before the actual computation within Spark. This can be controlled by spark.sql.execution.arrow.fallback.enabled.

For more details refer this link PySpark Usage Guide for Pandas with Apache Arrow

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Upvotes: 1

Related Questions