Pyspark - concatenate two dataframes based on one field

Question

I have two dataframes: The data is too large to use pandas, so I need to do this in Pyspark.

df1:

 col 1 | col 2 | col 3|
 12345 | asd   | zxc  |
 12345 | qwe   | dfg  |
 12345 | ert   | fgh  |

df2:

 col 1 | col 2 | col 3|
 54321 | asd   | poi  |
 54321 | qwe   | lkj  |
 54321 | ert   | mnb  |
 54321 | ytuyeye | jfg|

I want to concatenate these dataframes with the same columns so the width of the table doesn't expand and I'm only taking data in df2 that has a matching col 2. In the above example, I would expect the result to have 6 rows.

In other words, df1 is a snapshot and df2 is a snapshot at a later date where I only want to join in records that are present in df1. I've tried join and union and haven't had any luck. Thanks in advance

I got the answer:

df2_x = df2.alias('b').join(df1.alias('a'), ['col2'], "inner")\
                                             .select("*")
df3 = df1.union(df2_matches)

Pyspark - concatenate two dataframes based on one field

Answers (1)

Related Questions