Reputation: 373
I am reading a csv, converting it to a Spark dataframe and then doing some aggregations. The original csv has missing data, which is represented as NaN
when read via Pandas. After converting to PySpark, the NaN
values remain instead of being replaced by null
. This leads to my .avg()
aggregation having NaN
values.
df = pd.read_csv('data.csv', index_col = False)
spark_schema = StructType([
StructField("id",StringType(),True),
StructField("segment", StringType(), True),
StructField("score",DoubleType(),True),
StructField("sales", IntegerType(), True)
])
spark_df = spark.createDataFrame(data=df,schema=spark_schema).cache()
segment_avg = (
spark_df
.groupBy('segment')
.agg(
sf.avg('score').alias('score')
)
)
+----------+------+
|segment |score |
+----------+------+
|Gold |NaN |
|Platinum |NaN |
|Silver |NaN |
+----------+------+
I have tried to replace all NaN
in spark_df
like so:
spark_df = spark_df.replace(float('nan'), None)
But it errors:
ValueError: If to_replace is not a dict, value should be a float, int, long, string, list, or tuple. Got <class 'NoneType'>
spark_df.select('score').show(10,False)
+-----+
|score|
+-----+
|75.0 |
|34.0 |
|NaN |
|NaN |
|NaN |
|76.0 |
|65.0 |
|43.0 |
|78.0 |
|88.0 |
+-----+
How can I replace these Nan
values with null
so they are ignored when aggregating?
Upvotes: 1
Views: 4405
Reputation: 2939
Use isnan
function to replace NaN with null in a column
from pyspark.sql.functions import isnan, col, when, lit
spark_df = spark_df.withColumn('score', when(isnan(col('score')), lit(None)).otherwise(col('score')))
to replace all NaN in the entire dataframe, use .replace
spark_df = spark_df.replace(float('nan'), None)
Upvotes: 1