Pyspark append missing values of array by comparing with array of another row

Question

I have a dataframe as below:

ID       match set
1         [1,2]
2        [1,2,3]
3         [2,3]
4          [4]

I want to compare between rows and add the missing value if any one of the value in list is present in another list

example - in first row since the values [1,2] is present in second row [1,2,3], get the missing value and add to list in first row

output

ID       match set
1        [1,2,3]
2        [1,2,3]
3        [1,2,3]
4          [4]

anky · Accepted Answer

One approach is explode the array to create a helper column and self join the dataframe using that column, then group the IDs and collect back as a distinct array:

from pyspark.sql import functions as F, Window as W

a = df.select("*",F.explode("match_set").alias("V"))
agg_logic = F.array_sort(F.array_distinct(F.flatten(F.collect_set("match_set"))))
(a.drop("ID").join(a.drop("match_set"),on='V')
             .groupby("ID").agg(agg_logic.alias("match_set"))).show()

+---+---------+
| ID|match_set|
+---+---------+
|  1|[1, 2, 3]|
|  2|[1, 2, 3]|
|  3|[1, 2, 3]|
|  4|      [4]|
+---+---------+

Pyspark append missing values of array by comparing with array of another row

Answers (1)

Related Questions