Create a DataFrame showing the other IDs that share the value of one column with each ID

Question

I have the following DataFrame or user_id and label columns. One user can have several labels.

df = spark.createDataFrame(
    [(1, "a"), (2, "b"), (3, "a"), (1, "c"), (4, "b"), (5, "c"), (6, "a"), (7, "e")], ['user_id', 'label']
)

+-------+-----+                                                                 
|user_id|label|
+-------+-----+
|      1|    a|
|      2|    b|
|      3|    a|
|      1|    c|
|      4|    b|
|      5|    c|
|      6|    a|
|      7|    e|    
+-------+-----+

I want to create a new DataFrame that has 1 row for each user and shows an array of all other users that they share labels with:

+-------+-------------+
|user_id|  other_users|
+-------+-------------+
|      1|    [3, 5, 6]|
|      2|          [4]|
|      3|       [1, 6]|
|      4|          [2]|
|      5|          [1]|
|      6|       [1, 3]|
|      7|           []|
+-------+-------------+

What is the best way to achieve this?

AdibP · Accepted Answer

You can join with the dataframe itself and use collect_list

from  pyspark.sql.functions import col, collect_list

df = (df
      .join(df.selectExpr('user_id ui', 'label lb'),
            [col('label') == col('lb'), col('user_id') != col('ui')],
            'left')
      .groupBy('user_id').agg(collect_list('ui').alias('other_users')))
df.show()

+-------+-----------+
|user_id|other_users|
+-------+-----------+
|      7|         []|
|      6|     [1, 3]|
|      5|        [1]|
|      1|  [5, 3, 6]|
|      3|     [1, 6]|
|      2|        [4]|
|      4|        [2]|
+-------+-----------+

Create a DataFrame showing the other IDs that share the value of one column with each ID

Answers (2)

Related Questions