Reputation: 69
I have a dataframe as below:
ID match set
1 [1,2]
2 [1,2,3]
3 [2,3]
4 [4]
I want to compare between rows and add the missing value if any one of the value in list is present in another list
example - in first row since the values [1,2] is present in second row [1,2,3], get the missing value and add to list in first row
output
ID match set
1 [1,2,3]
2 [1,2,3]
3 [1,2,3]
4 [4]
Upvotes: 2
Views: 328
Reputation: 75120
One approach is explode the array to create a helper column and self join the dataframe using that column, then group the IDs and collect back as a distinct array:
from pyspark.sql import functions as F, Window as W
a = df.select("*",F.explode("match_set").alias("V"))
agg_logic = F.array_sort(F.array_distinct(F.flatten(F.collect_set("match_set"))))
(a.drop("ID").join(a.drop("match_set"),on='V')
.groupby("ID").agg(agg_logic.alias("match_set"))).show()
+---+---------+
| ID|match_set|
+---+---------+
| 1|[1, 2, 3]|
| 2|[1, 2, 3]|
| 3|[1, 2, 3]|
| 4| [4]|
+---+---------+
Upvotes: 1