datatatata
datatatata

Reputation: 43

How to replace NaN with 0 in PySpark data frame column?

Is there any way to replace NaN with 0 in PySpark using df.withColumn function like using fillna in Python?

Upvotes: 3

Views: 7303

Answers (2)

Vaebhav
Vaebhav

Reputation: 5052

fillna is natively available within Pyspark -

Apart from that you can do this with a combination of isNull and when -

Data Preparation

input_str = """
|1  |704    |STANDARD|null               |PR   |30100
|2  |null    |null    |PASEO COSTA DEL SUR|PR   |null
|3  |709    |null    |BDA SAN LUIS       |PR   |3700
|4  |null  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000
|5  |76177  |STANDARD|null               |TX   |null
""".split("|")

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str[1:]))

n = len(input_values)

cols = ['id','zipcode','type','city','state','population']

input_list = [tuple(input_values[i:i+6]) for i in range(0,n,6)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF = sparkDF.withColumn('zipcode',F.col('zipcode').cast(DoubleType()))\
                 .withColumn('population',F.col('population').cast(DoubleType()))

sparkDF.show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|   null|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|   null|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+

Fillna


sparkDF = sparkDF.na.fill(0)

sparkDF.show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|  704.0|STANDARD|               null|   PR|   30100.0|
|  2|    0.0|    null|PASEO COSTA DEL SUR|   PR|       0.0|
|  3|  709.0|    null|       BDA SAN LUIS|   PR|    3700.0|
|  4|    0.0|  UNIQUE|  CINGULAR WIRELESS|   TX|   84000.0|
|  5|76177.0|STANDARD|               null|   TX|       0.0|
+---+-------+--------+-------------------+-----+----------+

Fillna - Subset

sparkDF = sparkDF.na.fill(0,subset=["population"])

sparkDF.show()

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|  704.0|STANDARD|               null|   PR|   30100.0|
|  2|   null|    null|PASEO COSTA DEL SUR|   PR|       0.0|
|  3|  709.0|    null|       BDA SAN LUIS|   PR|    3700.0|
|  4|   null|  UNIQUE|  CINGULAR WIRELESS|   TX|   84000.0|
|  5|76177.0|STANDARD|               null|   TX|       0.0|
+---+-------+--------+-------------------+-----+----------+

When & isNull

sparkDF = sparkDF.withColumn('type',F.when(F.col('type').isNull(),'Not Available').otherwise(F.col('type')))

sparkDF.show()

+---+-------+-------------+-------------------+-----+----------+
| id|zipcode|         type|               city|state|population|
+---+-------+-------------+-------------------+-----+----------+
|  1|  704.0|     STANDARD|               null|   PR|   30100.0|
|  2|   null|Not Available|PASEO COSTA DEL SUR|   PR|      null|
|  3|  709.0|Not Available|       BDA SAN LUIS|   PR|    3700.0|
|  4|   null|       UNIQUE|  CINGULAR WIRELESS|   TX|   84000.0|
|  5|76177.0|     STANDARD|               null|   TX|      null|
+---+-------+-------------+-------------------+-----+----------+

Upvotes: 2

notNull
notNull

Reputation: 31520

Use when + otherwise statement with isnan function.

Example:

df.show()
#+---+---+
#| id|num|
#+---+---+
#|NaN|  2|
#+---+---+

from pyspark.sql.functions import *

df.withColumn("id",when(isnan(col("id")),lit(0)).otherwise(col("id"))).show()
#+---+---+
#| id|num|
#+---+---+
#|0.0|  2|
#+---+---+

df.fillna(0).show()
#+---+---+
#| id|num|
#+---+---+
#|0.0|  2|
#+---+---+

Upvotes: 1

Related Questions