Lilly
Lilly

Reputation: 988

pyspark merge two dataframe with added items/condition

I have two dataframes as below. If a person buys something, we can also recommend similar products.

df1 has a list of items bought by each person. df2 has recommended add-on products. For example "Gopu" buys bun, then I have to recommend "butter" and "jam"

If there is no added_product item (from df2) then it need not appear in the output. For (e.g) "Gopu" buys an item "biscuit" but there is no add on item to recommend from df2. Hence it will not appear in the output table. Thanks

Simple df1.df2 join by left is not working for me.

df1:
name  product
Gopu  biscuit
Gopu  bun
Gopu  ink
Aish  ball
Aish  doll
Aish  bun
Aish  ink
Colin bun
Colin handsanitize
Colin paper

df2:
product added-product 
bun     butter
bun     jam
ink     cloth
ink     bib
paper   pen
doll    barbie

Expected output:

Name    added-product
Gopu    butter
Gopu    jam
Gopu    cloth
Gopu    bib
Aish    barbie
Aish    butter
Aish    jam
Aish    cloth
Aish    bib
Colin    butter
Colin    jam
Colin    pen

Thanks.

Upvotes: 0

Views: 33

Answers (1)

Prathik Kini
Prathik Kini

Reputation: 1710

dfnew=df1.join(df2,(df1.product==df2.product),"cross").select('name','added-product').orderBy('name')

dfnew.show()
+-----+-------------+
| name|added-product|
+-----+-------------+
| Aish|       butter|
| Aish|          jam|
| Aish|        cloth|
| Aish|          bib|
| Aish|       barbie|
|Colin|          jam|
|Colin|          pen|
|Colin|       butter|
| Gopu|       butter|
| Gopu|        cloth|
| Gopu|          jam|
| Gopu|          bib|
+-----+-------------+

Upvotes: 1

Related Questions