Reputation: 23
How to obtain df3
from df1
and df2
? There is a match if df2.b
is in the list of items of df1.b
---------------- -------------- -------------------------------
| a | b | | b | c | | a | b | c |
---------------- -------------- => -------------------------------
| 2 | [3,4] | | 3 | Three | | 2 | [3, 4] | [Three, Four] |
| 3 | [4] | | 4 | Four | | 3 | [4] | [Four] |
---------------- -------------- -------------------------------
df1 df2 df3
Upvotes: 2
Views: 1971
Reputation: 32670
Use join with array_contains
in condition, then group by a
and collect_list
on column c
:
import pyspark.sql.functions as F
df1 = spark.createDataFrame([(2, [3, 4]), (3, [4])], ["a", "b"])
df2 = spark.createDataFrame([(3, "Three"), (4, "Four")], ["b", "c"])
df3 = df1.alias("df1").join(
df2.alias("df2"),
F.expr("array_contains(df1.b, df2.b)"),
"left"
).groupBy("df1.a").agg(
F.first("df1.b").alias("b"),
F.collect_list("df2.c").alias("c")
)
df3.show()
#+---+------+-------------+
#| a| b| c|
#+---+------+-------------+
#| 2|[3, 4]|[Three, Four]|
#| 3| [4]| [Four]|
#+---+------+-------------+
Upvotes: 2