Nick01
Nick01

Reputation: 369

joining two dataframes having duplicate row

I have the following two dataframes

df1

+--------+-----------------------------
|id  | amount       | fee             | 
|1   | 10.00        | 5.0             |
|3   | 90           | 130.0           |

df2

+--------+--------------------------------
|exId  | exAmount     | exFee            | 
|1     | 10.00        | 5.0              |
|1     | 10.0         | 5.0              |
|3     | 90.0         | 130.0              |

I am joining between them using all three columns and trying to identify columns which are common between the two dataframes and the ones which are not.

I'm looking for output:

+--------+--------------------------------------------
|id  | amount       | fee  |exId  | exAmount | exFee | 
|1   | 10.00        | 5.0  |1     | 10.0     | 5.0   |
|null| null         | null |1     | 10.0     | 5.0   |
|3   | 90           | 130.0|3     | 90.0     | 130.0 |

Basically want the duplicate row in df2 with exId 1 to be listed separately. Any thoughts?

Upvotes: 0

Views: 4600

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

One of the possible way is to group by all three columns and generate row numbers for each dataframe and use that additional column in addition to the rest three columns while joining. You should get what you desire.

import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("id", "amount", "fee").orderBy("fee")
def windowSpec2 = Window.partitionBy("exId", "exAmount", "exFee").orderBy("exFee")

import org.apache.spark.sql.functions._
df1.withColumn("sno", row_number().over(windowSpec1)).join(
  df2.withColumn("exSno", row_number().over(windowSpec2)),
  col("id") === col("exId") && col("amount") === col("exAmount") && col("fee") === col("exFee") && col("sno") === col("exSno"), "outer")
  .drop("sno", "exSno")
  .show(false)

and you should be getting

+----+------+-----+----+--------+-----+
|id  |amount|fee  |exId|exAmount|exFee|
+----+------+-----+----+--------+-----+
|null|null  |null |1   |10.0    |5.0  |
|3   |90    |130.0|3   |90      |130.0|
|1   |10.00 |5.0  |1   |10.00   |5.0  |
+----+------+-----+----+--------+-----+

I hope the answer is helpful

Upvotes: 1

Related Questions