Reputation: 83
I have a pyspark dataframe with following schema:
root
|-- src_ip: integer (nullable = true)
|-- dst_ip: integer (nullable = true)
When converting this dataframe to pandas via toPandas()
, the column type changes from integer in spark to float in pandas:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9847 entries, 0 to 9846
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 src_ip 9607 non-null float64
1 dst_ip 9789 non-null float64
dtypes: float64(2)
memory usage: 154.0 KB
Is there any way to keep integer value with toPandas()
or I can only cast column type in resulting pandas dataframe?
Upvotes: 8
Views: 4942
Reputation: 605
You should use the nullable integer dtype of Pandas
df = spark.createDataFrame([(0, 1), (0, None)], ["a", "b"])
print(df.dtypes)
# Cast the integer column to 'Int64'
pdf = df.toPandas()
pdf['b'] = pdf['b'].astype('Int64')
print(pdf.dtypes)
print(pdf)
The capital 'I' in 'Int64' is to differentiate from the NumPy’s 'int64' dtype.
Upvotes: 3
Reputation: 66
SPARK-21766 (https://issues.apache.org/jira/browse/SPARK-21766) explains the behavior your observed.
As a workaround, you can call fillna(0) before toPandas():
df1 = sc.createDataFrame([(0, None), (None, 8)], ["src_ip", "dest_ip"])
print(df1.dtypes)
# Reproduce the issue
pdf1 = df1.toPandas()
print(pdf1.dtypes)
# A workaround
pdf2 = df1.fillna(0).toPandas()
print(pdf2.dtypes)
Upvotes: 3