pyspark how can I combine column values if columns have a same value

Question

I have an idea to combine dataframe values (a list) in different rows with the same key. The combination must have a same value or more in different rows, so I can not just use df.groupBy('id') to get the result.

Following is the example:

+---------+--------------------+
|id       |num_list            |
+---------+--------------------+
|apple    |[11, 12]            |
|apple    |[11, 13 ,14]        |
|apple    |[10, 22, 25]        |
|banana   |[15, 26]            |
|banana   |[15, 29]            |
|banana   |[15, 27]            |
+---------+--------------------+

we can find id=apple have two record and two lists like: [11, 12],[11, 13, 14], so they would be combined to a new record id=apple,num_list=[11, 12, 13, 14]

But id=apple,num_list=[10, 22, 25] will not be combined.

and this is the answer i want:

+---------+--------------------+
|id       |num_list            |
+---------+--------------------+
|apple    |[11, 12, 13, 14]    |
|apple    |[10, 22, 25]        |
|banana   |[15, 26, 27, 29]    |
+---------+--------------------+

Edit:

There are some rules that I have to explain.

Just as @Usernamenotfound commented,

Assuming apple have three values [11, 12, 14], [9, 13 ,14], and [12,13,27], the answer will be [9, 11, 12, 13 ,14, 27] not [9, 11, 12, 13 ,14] and[12,13,27].

There are some new example:

+--------------------+------------+
|                  id|num_list    |
+--------------------+------------+
|apple               |         [0]|
|apple               |         [0]|
|apple               |         [1]|
|apple               |         [1]|
|apple               |         [2]|
|apple               |         [3]|
|apple               |         [4]|
|apple               |         [5]|
|apple               |         [6]|
|apple               |         [6]|
|apple               |         [7]|
|apple               |      [7, 8]|
|apple               |         [9]|
|apple               |         [9]|
|apple               |         [9]|
|apple               |     [9, 10]|
|apple               | [9, 17, 18]|
|apple               |        [10]|
|apple               |        [10]|
|apple               |        [10]|
+--------------------+------------+

and if i tried the code from @mayank would get wrong answer.

+--------------------------------+---------------------------------------------------------------------------------------+
|                              id|num_list                                                                               |
+--------------------------------+---------------------------------------------------------------------------------------+
|apple                           |[0]                                                                                    |
|apple                           |[0]                                                                                    |
|apple                           |[1]                                                                                    |
|apple                           |[2]                                                                                    |
|apple                           |[3]                                                                                    |
|apple                           |[4]                                                                                    |
|apple                           |[5]                                                                                    |
|apple                           |[6]                                                                                    |
|apple                           |[8, 7]                                                                                 |
|apple                           |[9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 36, 37, 38]|
|apple                           |[12]                                                                                   |
|apple                           |[14]                                                                                   |
|apple                           |[24]                                                                                   |
|apple                           |[31]                                                                                   |
|apple                           |[32]                                                                                   |
|apple                           |[33, 34]                                                                               |
|apple                           |[35]                                                                                   |
|apple                           |[39]                                                                                   |
+--------------------------------+---------------------------------------------------------------------------------------+

Any help would be greatly appreciated.

mayank agrawal · Accepted Answer

Maybe not the most efficient solution, but it solves your problem.

import pyspark.sql.functions as F
from pyspark.sql.types import *

def get_combinations(lis):
   final = []

   for li in lis:
       if [y for y in final if [z for z in li if z in y]]:
           found = [y for y in final if [z for z in li if z in y]][0]
           to_add = list(set(found + li))
           final[final.index(found)] = to_add
       else:
           final.append(li)
   return final

apply_udf = F.udf(lambda x:get_combinations(x),ArrayType(ArrayType(IntegerType())))

df = df.groupby('id').agg(F.collect_list('num_list').alias('num_list'))\
        .select(['id', F.explode(apply_udf('num_list')).alias('num_list')])

pyspark how can I combine column values if columns have a same value

Answers (1)

Related Questions