Reputation: 43
Is there any way to replace NaN with 0 in PySpark using df.withColumn
function like using fillna
in Python?
Upvotes: 3
Views: 7303
Reputation: 5052
fillna is natively available within Pyspark -
Apart from that you can do this with a combination of isNull and when -
input_str = """
|1 |704 |STANDARD|null |PR |30100
|2 |null |null |PASEO COSTA DEL SUR|PR |null
|3 |709 |null |BDA SAN LUIS |PR |3700
|4 |null |UNIQUE |CINGULAR WIRELESS |TX |84000
|5 |76177 |STANDARD|null |TX |null
""".split("|")
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str[1:]))
n = len(input_values)
cols = ['id','zipcode','type','city','state','population']
input_list = [tuple(input_values[i:i+6]) for i in range(0,n,6)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('zipcode',F.col('zipcode').cast(DoubleType()))\
.withColumn('population',F.col('population').cast(DoubleType()))
sparkDF.show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+--------+-------------------+-----+----------+
| 1| 704|STANDARD| null| PR| 30100|
| 2| null| null|PASEO COSTA DEL SUR| PR| null|
| 3| 709| null| BDA SAN LUIS| PR| 3700|
| 4| null| UNIQUE| CINGULAR WIRELESS| TX| 84000|
| 5| 76177|STANDARD| null| TX| null|
+---+-------+--------+-------------------+-----+----------+
sparkDF = sparkDF.na.fill(0)
sparkDF.show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+--------+-------------------+-----+----------+
| 1| 704.0|STANDARD| null| PR| 30100.0|
| 2| 0.0| null|PASEO COSTA DEL SUR| PR| 0.0|
| 3| 709.0| null| BDA SAN LUIS| PR| 3700.0|
| 4| 0.0| UNIQUE| CINGULAR WIRELESS| TX| 84000.0|
| 5|76177.0|STANDARD| null| TX| 0.0|
+---+-------+--------+-------------------+-----+----------+
sparkDF = sparkDF.na.fill(0,subset=["population"])
sparkDF.show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+--------+-------------------+-----+----------+
| 1| 704.0|STANDARD| null| PR| 30100.0|
| 2| null| null|PASEO COSTA DEL SUR| PR| 0.0|
| 3| 709.0| null| BDA SAN LUIS| PR| 3700.0|
| 4| null| UNIQUE| CINGULAR WIRELESS| TX| 84000.0|
| 5|76177.0|STANDARD| null| TX| 0.0|
+---+-------+--------+-------------------+-----+----------+
sparkDF = sparkDF.withColumn('type',F.when(F.col('type').isNull(),'Not Available').otherwise(F.col('type')))
sparkDF.show()
+---+-------+-------------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+-------------+-------------------+-----+----------+
| 1| 704.0| STANDARD| null| PR| 30100.0|
| 2| null|Not Available|PASEO COSTA DEL SUR| PR| null|
| 3| 709.0|Not Available| BDA SAN LUIS| PR| 3700.0|
| 4| null| UNIQUE| CINGULAR WIRELESS| TX| 84000.0|
| 5|76177.0| STANDARD| null| TX| null|
+---+-------+-------------+-------------------+-----+----------+
Upvotes: 2
Reputation: 31520
Use when + otherwise
statement with isnan
function.
Example:
df.show()
#+---+---+
#| id|num|
#+---+---+
#|NaN| 2|
#+---+---+
from pyspark.sql.functions import *
df.withColumn("id",when(isnan(col("id")),lit(0)).otherwise(col("id"))).show()
#+---+---+
#| id|num|
#+---+---+
#|0.0| 2|
#+---+---+
df.fillna(0).show()
#+---+---+
#| id|num|
#+---+---+
#|0.0| 2|
#+---+---+
Upvotes: 1