Reputation: 73
id | val1 | val2 |
---|---|---|
1 | Y | Flagged |
1 | N | Flagged |
2 | N | Flagged |
2 | Y | Flagged |
2 | N | Flagged |
I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.
Upvotes: 1
Views: 141
Reputation: 75080
You can also use a window to collect the set of values and compare to the array of Y and N values:
from pyspark.sql import functions as F, Window as W
a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
F.collect_set("val1").over(W.partitionBy("id")))==a))
out.show()
+---+----+-------+-------+
| id|val1| val2|Flagged|
+---+----+-------+-------+
| 1| Y|Flagged| true|
| 1| N|Flagged| true|
| 2| N|Flagged| true|
| 2| Y|Flagged| true|
| 2| N|Flagged| true|
+---+----+-------+-------+
Upvotes: 2
Reputation: 71689
Assuming the val1
columns contains only Y
and N
as unique values, you can group the dataframe by id
and aggregate val1
using countDistinct
to count the unique values, then create a new column flagged
corresponding the condition where distinct count > 1, finally join
this new column with original dataframe to get the result
from pyspark.sql import functions as F
counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
If column val1
may contains other values along with Y, N
, then first mask the values which are not in Y
and N
:
vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))
counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')
>>> df.show()
| id|val1|flagged|
+---+----+-------+
| 1| Y| true|
| 1| N| true|
| 2| N| true|
| 2| Y| true|
| 2| N| true|
+---+----+-------+
PS: I have also modified your output slightly as having a column named flagged
with boolean values makes more sense
Upvotes: 2