Charbel
Charbel

Reputation: 73

Edit the value of all the rows of column with the same id based on the value of another column in pyspark

id val1 val2
1 Y Flagged
1 N Flagged
2 N Flagged
2 Y Flagged
2 N Flagged

I have the above table. I want to check the rows in val1 with the same id, if there's at least one Y and one N then all the rows having 1 as id will be marked flagged in val2. In addition, for a more efficient code, I want the code to jump to the next id once it finds a Y.

Upvotes: 1

Views: 141

Answers (2)

anky
anky

Reputation: 75080

You can also use a window to collect the set of values and compare to the array of Y and N values:

from pyspark.sql import functions as F, Window as W

a = F.array([F.lit('N'),F.lit('Y')])
out = (df.withColumn("Flagged",F.array_intersect(a,
                      F.collect_set("val1").over(W.partitionBy("id")))==a))

out.show()

+---+----+-------+-------+
| id|val1|   val2|Flagged|
+---+----+-------+-------+
|  1|   Y|Flagged|   true|
|  1|   N|Flagged|   true|
|  2|   N|Flagged|   true|
|  2|   Y|Flagged|   true|
|  2|   N|Flagged|   true|
+---+----+-------+-------+

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71689

Assuming the val1 columns contains only Y and N as unique values, you can group the dataframe by id and aggregate val1 using countDistinct to count the unique values, then create a new column flagged corresponding the condition where distinct count > 1, finally join this new column with original dataframe to get the result

from pyspark.sql import functions as F

counts = df.groupBy('id').agg(F.countDistinct('val1').alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')

If column val1 may contains other values along with Y, N, then first mask the values which are not in Y and N:

vals = F.when(F.col('val1').isin(['Y', 'N']), F.col('val1'))

counts = df.groupBy('id').agg(F.countDistinct(vals).alias('flagged'))
df = df.join(counts.withColumn('flagged', F.col('flagged') > 1), on='id')

>>> df.show()

| id|val1|flagged|
+---+----+-------+
|  1|   Y|   true|
|  1|   N|   true|
|  2|   N|   true|
|  2|   Y|   true|
|  2|   N|   true|
+---+----+-------+

PS: I have also modified your output slightly as having a column named flagged with boolean values makes more sense

Upvotes: 2

Related Questions