How to replace null values in the output of a left join operation with 0 in pyspark dataframe?

Question

I have a simple PySpark dataframe, df1-

df1 = spark.createDataFrame([
    ("u1", 1),
    ("u1", 2),
    ("u2", 3),
    ("u3", 4),

    ],
    ['user_id', 'var1'])

print(df1.printSchema())
df1.show(truncate=False)

Output-

root
 |-- user_id: string (nullable = true)
 |-- var1: long (nullable = true)

None
+-------+----+
|user_id|var1|
+-------+----+
|u1     |1   |
|u1     |2   |
|u2     |3   |
|u3     |4   |
+-------+----+

I have another PySpark dataframe df2-

df2 = spark.createDataFrame([
    (1, 'f1'),
    (2, 'f2'),

    ],
    ['var1', 'var2'])

print(df2.printSchema())
df2.show(truncate=False)

Output-

root
 |-- var1: long (nullable = true)
 |-- var2: string (nullable = true)

None
+----+----+
|var1|var2|
+----+----+
|1   |f1  |
|2   |f2  |
+----+----+

I have to join the two dataframes mentioned above, by using a left-join operation on them-

df1.join(df2, df1.var1==df2.var1, 'left').show()

Output-

+-------+----+----+----+
|user_id|var1|var1|var2|
+-------+----+----+----+
|     u1|   1|   1|  f1|
|     u1|   2|   2|  f2|
|     u2|   3|null|null|
|     u3|   4|null|null|
+-------+----+----+----+

But as you can see, I am getting null values in the rows for which there two tables don't have a match. How can I replace all the null values with 0?

mck · Accepted Answer

You can use fillna. Two fillnas are needed to account for integer and string columns.

df1.join(df2, df1.var1==df2.var1, 'left').fillna(0).fillna("0")

How to replace null values in the output of a left join operation with 0 in pyspark dataframe?

Answers (2)

Related Questions