Reputation: 273
Imagine I have a table:
id | Feature |
---|---|
1 | a |
1 | b |
1 | c |
1 | null |
2 | a |
2 | b |
2 | c |
3 | a |
3 | b |
3 | null |
Resulting table should be:
id | Feature | Contains null |
---|---|---|
1 | a | True |
1 | b | True |
1 | c | True |
1 | null | True |
2 | a | False |
2 | b | False |
2 | c | False |
3 | a | True |
3 | b | True |
3 | null | True |
Because id 1 and 3 has a row in Feature column with null.
Upvotes: 1
Views: 611
Reputation: 75120
In pyspark, you need a window function:
from pyspark.sql import functions as F, Window as W
w = W.partitionBy("id").orderBy("id")
df.withColumn("Contains_Null",F.max(F.col("Feature").isNull()).over(w)).show()
+---+-------+-------------+
| id|Feature|Contains_Null|
+---+-------+-------------+
| 1| a| true|
| 1| b| true|
| 1| c| true|
| 1| null| true|
| 2| a| false|
| 2| b| false|
| 2| c| false|
| 3| a| true|
| 3| b| true|
| 3| null| true|
+---+-------+-------------+
Upvotes: 2
Reputation: 153510
Since you tagged pandas it is pretty straight forward:
df['Contains null'] = df.groupby('id')['Feature'].transform(lambda x: x.isna().any())
Output:
id Feature Contains null
0 1 a True
1 1 b True
2 1 c True
3 1 NaN True
4 2 a False
5 2 b False
6 2 c False
7 3 a True
8 3 b True
9 3 NaN True
Upvotes: 2