ihopethiswillfi
ihopethiswillfi

Reputation: 115

How to transform a Polars DataFrame to a pySpark DataFrame?

How to correctly transform a Polars DataFrame to a pySpark DataFrame?

More specifically, the conversion methods which I've tried all seem to have problems parsing columns containing arrays / lists.

create spark dataframe

data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
        {"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
        {"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
        ]

sparkdf = spark.createDataFrame(data)

convert it to polars

import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))

try to convert back to spark dataframe (attempt 1)

spark.createDataFrame(pldf.to_pandas())


TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.

try to convert back to spark dataframe (attempt 2)

schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)

TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>

relevant: How to transform Spark dataframe to Polars dataframe?

Upvotes: 4

Views: 4634

Answers (3)

Luca
Luca

Reputation: 1914

I discovered the right way to do this while reading 'In-Memory Analytics with Apache Arrow' by Matthew Topol.

You can do:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())

It's quite fast.

I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.

Not setting spark.sql.execution.arrow.pyspark.enabled to true increased the time 90-fold in my test (from 1.5 seconds to 2 minutes 18 seconds).

Upvotes: 4

Dean MacGregor
Dean MacGregor

Reputation: 18556

What about

spark.createDataFrame(pldf.to_dicts())

Alternatively you could do:

spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})

Since the to_dict method returns polars Series instead of lists, I'm using a comprehension to convert the Series into regular lists which spark comprehends.

Upvotes: 2

Anurag negi
Anurag negi

Reputation: 1

DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.

Upvotes: 0

Related Questions