cs_guy
cs_guy

Reputation: 373

Replace NaN with null when using CreateDataFrame with pandas data

I am reading a csv, converting it to a Spark dataframe and then doing some aggregations. The original csv has missing data, which is represented as NaN when read via Pandas. After converting to PySpark, the NaN values remain instead of being replaced by null. This leads to my .avg() aggregation having NaN values.

df = pd.read_csv('data.csv', index_col = False)

spark_schema = StructType([
    StructField("id",StringType(),True),
    StructField("segment", StringType(), True),
    StructField("score",DoubleType(),True),
    StructField("sales", IntegerType(), True)
  ])

spark_df = spark.createDataFrame(data=df,schema=spark_schema).cache()

segment_avg = (
    spark_df
    .groupBy('segment')
    .agg(
        sf.avg('score').alias('score')
    )
)

+----------+------+
|segment   |score |
+----------+------+
|Gold      |NaN   |
|Platinum  |NaN   |
|Silver    |NaN   |
+----------+------+

I have tried to replace all NaN in spark_df like so:

spark_df = spark_df.replace(float('nan'), None)

But it errors:

ValueError: If to_replace is not a dict, value should be a float, int, long, string, list, or tuple. Got <class 'NoneType'>

spark_df.select('score').show(10,False)

+-----+
|score|
+-----+
|75.0 |
|34.0 |
|NaN  |
|NaN  |
|NaN  |
|76.0 |
|65.0 |
|43.0 |
|78.0 |
|88.0 |
+-----+

How can I replace these Nan values with null so they are ignored when aggregating?

Upvotes: 1

Views: 4405

Answers (1)

AdibP
AdibP

Reputation: 2939

Use isnan function to replace NaN with null in a column

from pyspark.sql.functions import isnan, col, when, lit

spark_df = spark_df.withColumn('score', when(isnan(col('score')), lit(None)).otherwise(col('score')))

to replace all NaN in the entire dataframe, use .replace

spark_df = spark_df.replace(float('nan'), None)

Upvotes: 1

Related Questions