Reputation: 153
Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
Upvotes: 4
Views: 6508
Reputation: 40360
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce
on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.
Upvotes: 11