Hari Seldon
Hari Seldon

Reputation: 141

PySpark's "DataFrameLike" type vs pandas.DataFrame

Spark 3.1 introduced type hints for python (hooray!) but I am puzzled as to why the return type of the toPandas method is "DataFrameLike" instead of pandas.DataFrame - see here: https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.pyi

Because of this mypy throws all sorts of errors if I try to use any of the pandas df methods on an object that's the result of calling toPandas. For example

df = spark_df.toPandas()
df.to_csv(out_path, index=False)

results in the error message

error: "DataFrameLike" has no attribute "to_csv" 

What's going on here?

Upvotes: 9

Views: 2207

Answers (2)

Aaron McDaid
Aaron McDaid

Reputation: 27123

To fix the mypy warnings:

cast has no effect at runtime, but it tells mypy to treat it as a real pandas.DataFrame for the purposes of type-checking.

I like the other answers here, and maybe you can fix it without this cast trick/hack, but I'm giving this as another option

import pandas as pd
from typing import cast

df = cast(pd.DataFrame, spark_df.toPandas())
df.to_csv(out_path, index=False)

Upvotes: 1

Duong Vu
Duong Vu

Reputation: 197

I believe this issue is fixed by this recent commit (dated Dec 22, 2021): https://github.com/apache/spark/commit/a70006d9a7b578721d152d0f89d1a894de38c25d

Right now when you use .toPandas() and print out type, it will actually give you Pandas DataFrame.

To read more about it, since your link is broken, here's the source code for DataFrameLike

So make sure you update your pyspark to the latest version.

Upvotes: 1

Related Questions