PySpark toPandas function is changing column type

Question

I have a pyspark dataframe with following schema:

root
 |-- src_ip: integer (nullable = true)
 |-- dst_ip: integer (nullable = true)

When converting this dataframe to pandas via toPandas(), the column type changes from integer in spark to float in pandas:


RangeIndex: 9847 entries, 0 to 9846
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   src_ip  9607 non-null   float64
 1   dst_ip  9789 non-null   float64
dtypes: float64(2)
memory usage: 154.0 KB

Is there any way to keep integer value with toPandas() or I can only cast column type in resulting pandas dataframe?

acan · Accepted Answer

SPARK-21766 (https://issues.apache.org/jira/browse/SPARK-21766) explains the behavior your observed.

As a workaround, you can call fillna(0) before toPandas():

df1 = sc.createDataFrame([(0, None), (None, 8)], ["src_ip", "dest_ip"])
print(df1.dtypes)

# Reproduce the issue
pdf1 = df1.toPandas()
print(pdf1.dtypes)

# A workaround
pdf2 = df1.fillna(0).toPandas()
print(pdf2.dtypes)

PySpark toPandas function is changing column type

Answers (2)

Related Questions