Reputation: 141
Spark 3.1 introduced type hints for python (hooray!) but I am puzzled as to why the return type of the toPandas method is "DataFrameLike" instead of pandas.DataFrame - see here: https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.pyi
Because of this mypy throws all sorts of errors if I try to use any of the pandas df methods on an object that's the result of calling toPandas. For example
df = spark_df.toPandas()
df.to_csv(out_path, index=False)
results in the error message
error: "DataFrameLike" has no attribute "to_csv"
What's going on here?
Upvotes: 9
Views: 2207
Reputation: 27123
To fix the mypy
warnings:
cast
has no effect at runtime, but it tells mypy
to treat it as a real pandas.DataFrame
for the purposes of type-checking.
I like the other answers here, and maybe you can fix it without this cast
trick/hack, but I'm giving this as another option
import pandas as pd
from typing import cast
df = cast(pd.DataFrame, spark_df.toPandas())
df.to_csv(out_path, index=False)
Upvotes: 1
Reputation: 197
I believe this issue is fixed by this recent commit (dated Dec 22, 2021): https://github.com/apache/spark/commit/a70006d9a7b578721d152d0f89d1a894de38c25d
Right now when you use .toPandas()
and print out type, it will actually give you Pandas DataFrame.
To read more about it, since your link is broken, here's the source code for DataFrameLike
So make sure you update your pyspark to the latest version.
Upvotes: 1