Reputation: 115
How to correctly transform a Polars DataFrame to a pySpark DataFrame?
More specifically, the conversion methods which I've tried all seem to have problems parsing columns containing arrays / lists.
create spark dataframe
data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
{"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
{"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
]
sparkdf = spark.createDataFrame(data)
convert it to polars
import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))
try to convert back to spark dataframe (attempt 1)
spark.createDataFrame(pldf.to_pandas())
TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.
try to convert back to spark dataframe (attempt 2)
schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)
TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>
relevant: How to transform Spark dataframe to Polars dataframe?
Upvotes: 4
Views: 4634
Reputation: 1914
I discovered the right way to do this while reading 'In-Memory Analytics with Apache Arrow' by Matthew Topol.
You can do:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())
It's quite fast.
I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)
) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.
Not setting spark.sql.execution.arrow.pyspark.enabled
to true increased the time 90-fold in my test (from 1.5 seconds to 2 minutes 18 seconds).
Upvotes: 4
Reputation: 18556
What about
spark.createDataFrame(pldf.to_dicts())
Alternatively you could do:
spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})
Since the to_dict
method returns polars Series instead of lists, I'm using a comprehension to convert the Series into regular lists which spark comprehends.
Upvotes: 2
Reputation: 1
DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.
Upvotes: 0