MichiganMagician
MichiganMagician

Reputation: 273

Pyspark: How to derive a new column's value based on another column if any of the rows with specific id contains null?

Imagine I have a table:

id Feature
1 a
1 b
1 c
1 null
2 a
2 b
2 c
3 a
3 b
3 null

Resulting table should be:

id Feature Contains null
1 a True
1 b True
1 c True
1 null True
2 a False
2 b False
2 c False
3 a True
3 b True
3 null True

Because id 1 and 3 has a row in Feature column with null.

Upvotes: 1

Views: 611

Answers (2)

anky
anky

Reputation: 75120

In pyspark, you need a window function:

from pyspark.sql import functions as F, Window as W
w = W.partitionBy("id").orderBy("id")
df.withColumn("Contains_Null",F.max(F.col("Feature").isNull()).over(w)).show()

+---+-------+-------------+
| id|Feature|Contains_Null|
+---+-------+-------------+
|  1|      a|         true|
|  1|      b|         true|
|  1|      c|         true|
|  1|   null|         true|
|  2|      a|        false|
|  2|      b|        false|
|  2|      c|        false|
|  3|      a|         true|
|  3|      b|         true|
|  3|   null|         true|
+---+-------+-------------+

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153510

Since you tagged pandas it is pretty straight forward:

df['Contains null'] = df.groupby('id')['Feature'].transform(lambda x: x.isna().any())

Output:

   id Feature  Contains null
0   1       a           True
1   1       b           True
2   1       c           True
3   1     NaN           True
4   2       a          False
5   2       b          False
6   2       c          False
7   3       a           True
8   3       b           True
9   3     NaN           True

Upvotes: 2

Related Questions