Replace NaN with null when using CreateDataFrame with pandas data

Question

I am reading a csv, converting it to a Spark dataframe and then doing some aggregations. The original csv has missing data, which is represented as NaN when read via Pandas. After converting to PySpark, the NaN values remain instead of being replaced by null. This leads to my .avg() aggregation having NaN values.

df = pd.read_csv('data.csv', index_col = False)

spark_schema = StructType([
    StructField("id",StringType(),True),
    StructField("segment", StringType(), True),
    StructField("score",DoubleType(),True),
    StructField("sales", IntegerType(), True)
  ])

spark_df = spark.createDataFrame(data=df,schema=spark_schema).cache()

segment_avg = (
    spark_df
    .groupBy('segment')
    .agg(
        sf.avg('score').alias('score')
    )
)

+----------+------+
|segment   |score |
+----------+------+
|Gold      |NaN   |
|Platinum  |NaN   |
|Silver    |NaN   |
+----------+------+

I have tried to replace all NaN in spark_df like so:

spark_df = spark_df.replace(float('nan'), None)

But it errors:

ValueError: If to_replace is not a dict, value should be a float, int, long, string, list, or tuple. Got

spark_df.select('score').show(10,False)

+-----+
|score|
+-----+
|75.0 |
|34.0 |
|NaN  |
|NaN  |
|NaN  |
|76.0 |
|65.0 |
|43.0 |
|78.0 |
|88.0 |
+-----+

How can I replace these Nan values with null so they are ignored when aggregating?

AdibP · Accepted Answer

Use isnan function to replace NaN with null in a column

from pyspark.sql.functions import isnan, col, when, lit

spark_df = spark_df.withColumn('score', when(isnan(col('score')), lit(None)).otherwise(col('score')))

to replace all NaN in the entire dataframe, use .replace

spark_df = spark_df.replace(float('nan'), None)

Replace NaN with null when using CreateDataFrame with pandas data

Answers (1)

Related Questions