Pyspark: How to derive a new column's value based on another column if any of the rows with specific id contains null?

Question

Imagine I have a table:

id	Feature
1	a
1	b
1	c
1	null
2	a
2	b
2	c
3	a
3	b
3	null

Resulting table should be:

id	Feature	Contains null
1	a	True
1	b	True
1	c	True
1	null	True
2	a	False
2	b	False
2	c	False
3	a	True
3	b	True
3	null	True

Because id 1 and 3 has a row in Feature column with null.

anky · Accepted Answer

In pyspark, you need a window function:

from pyspark.sql import functions as F, Window as W
w = W.partitionBy("id").orderBy("id")
df.withColumn("Contains_Null",F.max(F.col("Feature").isNull()).over(w)).show()

+---+-------+-------------+
| id|Feature|Contains_Null|
+---+-------+-------------+
|  1|      a|         true|
|  1|      b|         true|
|  1|      c|         true|
|  1|   null|         true|
|  2|      a|        false|
|  2|      b|        false|
|  2|      c|        false|
|  3|      a|         true|
|  3|      b|         true|
|  3|   null|         true|
+---+-------+-------------+

Pyspark: How to derive a new column's value based on another column if any of the rows with specific id contains null?

Answers (2)

Related Questions

Pyspark: How to derive a new column&#39;s value based on another column if any of the rows with specific id contains null?

Answers (2)

Related Questions

Pyspark: How to derive a new column's value based on another column if any of the rows with specific id contains null?