Alex Mathew
Alex Mathew

Reputation: 23

Join on items inside an array column in pyspark dataframe

How to obtain df3 from df1 and df2? There is a match if df2.b is in the list of items of df1.b

----------------        --------------          -------------------------------
| a |    b     |        | b  |   c   |          | a |    b    |      c        |       
----------------        --------------     =>   -------------------------------
| 2 | [3,4]    |        | 3  | Three |          | 2 |  [3, 4] | [Three, Four] |
| 3 | [4]      |        | 4  | Four  |          | 3 |  [4]    | [Four]        |
----------------        --------------          -------------------------------
  df1                         df2                            df3

Upvotes: 2

Views: 1971

Answers (1)

blackbishop
blackbishop

Reputation: 32670

Use join with array_contains in condition, then group by a and collect_list on column c:

import pyspark.sql.functions as F

df1 = spark.createDataFrame([(2, [3, 4]), (3, [4])], ["a", "b"])
df2 = spark.createDataFrame([(3, "Three"), (4, "Four")], ["b", "c"])

df3 = df1.alias("df1").join(
    df2.alias("df2"),
    F.expr("array_contains(df1.b, df2.b)"),
    "left"
).groupBy("df1.a").agg(
    F.first("df1.b").alias("b"),
    F.collect_list("df2.c").alias("c")
)

df3.show()
#+---+------+-------------+
#|  a|     b|            c|
#+---+------+-------------+
#|  2|[3, 4]|[Three, Four]|
#|  3|   [4]|       [Four]|
#+---+------+-------------+

Upvotes: 2

Related Questions