emax
emax

Reputation: 7235

Pyspark: how to join two dataframes over multiple columns?

I have two pyspark dataframes df1 and df2

df1
       id1   id2    id3    x    y
        0     1      2    0.5  0.4
        2     1      0    0.3  0.2
        3     0      2    0.8  0.9 
        2     1      3    0.2  0.1

df2
       id     name
        0      A 
        1      B
        2      C
        3      D

I would like to join the two dataframes and have

df3
       id1   id2    id3    n1   n2   n3   x    y 
        0     1      2     A    B    C   0.5  0.4
        2     1      0     C    B    A   0.3  0.2 
        3     0      2     D    A    C   0.8  0.9
        2     1      3     C    B    D   0.2  0.1

Upvotes: 0

Views: 83

Answers (1)

Lamanus
Lamanus

Reputation: 13551

Here is the multiple joins.

df1.join(df2, df1['id1'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n1') \
   .join(df2, df1['id2'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n2') \
   .join(df2, df1['id3'] == df2['id'], 'left').drop('id').withColumnRenamed('name', 'n3') \
   .show()

+---+---+---+---+---+---+---+---+
|id1|id2|id3|  x|  y| n1| n2| n3|
+---+---+---+---+---+---+---+---+
|  0|  1|  2|0.5|0.4|  A|  B|  C|
|  2|  1|  0|0.3|0.2|  C|  B|  A|
|  3|  0|  2|0.8|0.9|  D|  A|  C|
|  2|  1|  3|0.2|0.1|  C|  B|  D|
+---+---+---+---+---+---+---+---+

Upvotes: 1

Related Questions