Reputation: 3534
I want to replace null values in one column with the values in an adjacent column ,for example if i have
A|B
0,1
2,null
3,null
4,2
I want it to be:
A|B
0,1
2,2
3,3
4,2
Tried with
df.na.fill(df.A,"B")
But didnt work, it says value should be a float, int, long, string, or dict
Any ideas?
Upvotes: 37
Views: 66113
Reputation: 15058
Note: coalesce will not replace NaN
values, only null
s:
import pyspark.sql.functions as F
>>> cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
>>> cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
>>> cDf.select(F.coalesce(cDf["a"], cDf["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
| null|
| 1|
| 2|
+--------------+
Let's now create a pandas.DataFrame
with None
entries, convert it into spark.DataFrame
and use coalesce
again:
>>> cDf_from_pd = spark.createDataFrame(pd.DataFrame({'a': [None, 1, None], 'b': [None, None, 2]}))
>>> cDf_from_pd.show()
+---+---+
| a| b|
+---+---+
|NaN|NaN|
|1.0|NaN|
|NaN|2.0|
+---+---+
>>> cDf_from_pd.select(F.coalesce(cDf_from_pd["a"], cDf_from_pd["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
| NaN|
| 1.0|
| NaN|
+--------------+
In which case you'll need to first call replace on your DataFrame
to convert NaN
s to null
s.
Upvotes: 0
Reputation: 3534
We can use coalesce
from pyspark.sql.functions import coalesce
df.withColumn("B",coalesce(df.B,df.A))
Upvotes: 74
Reputation: 1881
Another Answer.
If the below df1
your dataframe
rd1 = sc.parallelize([(0,1), (2,None), (3,None), (4,2)])
df1 = rd1.toDF(['A', 'B'])
from pyspark.sql.functions import when
df1.select('A',
when( df1.B.isNull(), df1.A).otherwise(df1.B).alias('B')
)\
.show()
Upvotes: 17
Reputation: 3619
df.rdd.map(lambda row: row if row[1] else Row(a=row[0],b=row[0])).toDF().show()
Upvotes: 3