manish dev
manish dev

Reputation: 97

Can spark dataframe (scala) be converted to dataframe in pandas (python)

The Dataframe is created using scala api for SPARK

val someDF = spark.createDataFrame( spark.sparkContext.parallelize(someData), StructType(someSchema) )

I want to convert this to Pandas Dataframe

PySpark provides .toPandas() to convert a spark dataframe to pandas but there is no equivalent for scala(that I can find)

Please help me in this regard.

Upvotes: 0

Views: 2318

Answers (2)

gsthina
gsthina

Reputation: 1100

To convert a Spark DataFrame into a Pandas DataFrame, you can enable spark.sql.execution.arrow.enabled to true and then read/create a DataFrame using Spark and then convert it to Pandas DataFrame using Arrow

  1. Enable spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  2. Create DataFrame using Spark like you did:
    val someDF = spark.createDataFrame()
  1. Convert the same to a pandas DataFrame
result_pdf = someDF.select("*").toPandas()

The above commands run using Arrow, because of the config spark.sql.execution.arrow.enabled set to true

Hope this helps!

Upvotes: 2

Boris Azanov
Boris Azanov

Reputation: 4481

In Spark DataFrame is just abstraction above data, most common sources of data are files from file system. When you convert dataframe in PySpark to Pandas format, PySpark just convert PySpark abstraction above data to another abstraction from another python framework. If you want made conversion in Scala between Spark and Pandas you can't do that because Pandas is Python library for work with data but spark is not and you will have some difficulties with Python and Scala integration. The best simple things you can do here:

  1. Write dataframe to file system on scala Spark
  2. Read data from file system using Pandas.

Upvotes: 1

Related Questions