Reputation: 302
I have an idea to combine dataframe values (a list) in different rows with the same key.
The combination must have a same value or more in different rows, so I can not just use df.groupBy('id')
to get the result.
Following is the example:
+---------+--------------------+
|id |num_list |
+---------+--------------------+
|apple |[11, 12] |
|apple |[11, 13 ,14] |
|apple |[10, 22, 25] |
|banana |[15, 26] |
|banana |[15, 29] |
|banana |[15, 27] |
+---------+--------------------+
we can find id=apple
have two record and two lists like: [11, 12],[11, 13, 14]
, so they would be combined to a new record id=apple,num_list=[11, 12, 13, 14]
But id=apple,num_list=[10, 22, 25]
will not be combined.
and this is the answer i want:
+---------+--------------------+
|id |num_list |
+---------+--------------------+
|apple |[11, 12, 13, 14] |
|apple |[10, 22, 25] |
|banana |[15, 26, 27, 29] |
+---------+--------------------+
Edit:
There are some rules that I have to explain.
Just as @Usernamenotfound commented,
Assuming apple have three values [11, 12, 14]
, [9, 13 ,14]
, and [12,13,27]
, the answer will be [9, 11, 12, 13 ,14, 27]
not [9, 11, 12, 13 ,14]
and[12,13,27]
.
There are some new example:
+--------------------+------------+
| id|num_list |
+--------------------+------------+
|apple | [0]|
|apple | [0]|
|apple | [1]|
|apple | [1]|
|apple | [2]|
|apple | [3]|
|apple | [4]|
|apple | [5]|
|apple | [6]|
|apple | [6]|
|apple | [7]|
|apple | [7, 8]|
|apple | [9]|
|apple | [9]|
|apple | [9]|
|apple | [9, 10]|
|apple | [9, 17, 18]|
|apple | [10]|
|apple | [10]|
|apple | [10]|
+--------------------+------------+
and if i tried the code from @mayank would get wrong answer.
+--------------------------------+---------------------------------------------------------------------------------------+
| id|num_list |
+--------------------------------+---------------------------------------------------------------------------------------+
|apple |[0] |
|apple |[0] |
|apple |[1] |
|apple |[2] |
|apple |[3] |
|apple |[4] |
|apple |[5] |
|apple |[6] |
|apple |[8, 7] |
|apple |[9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 36, 37, 38]|
|apple |[12] |
|apple |[14] |
|apple |[24] |
|apple |[31] |
|apple |[32] |
|apple |[33, 34] |
|apple |[35] |
|apple |[39] |
+--------------------------------+---------------------------------------------------------------------------------------+
Any help would be greatly appreciated.
Upvotes: 1
Views: 2289
Reputation: 2545
Maybe not the most efficient solution, but it solves your problem.
import pyspark.sql.functions as F
from pyspark.sql.types import *
def get_combinations(lis):
final = []
for li in lis:
if [y for y in final if [z for z in li if z in y]]:
found = [y for y in final if [z for z in li if z in y]][0]
to_add = list(set(found + li))
final[final.index(found)] = to_add
else:
final.append(li)
return final
apply_udf = F.udf(lambda x:get_combinations(x),ArrayType(ArrayType(IntegerType())))
df = df.groupby('id').agg(F.collect_list('num_list').alias('num_list'))\
.select(['id', F.explode(apply_udf('num_list')).alias('num_list')])
Upvotes: 1