Reputation: 369
I have the following two dataframes
df1
+--------+-----------------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|3 | 90 | 130.0 |
df2
+--------+--------------------------------
|exId | exAmount | exFee |
|1 | 10.00 | 5.0 |
|1 | 10.0 | 5.0 |
|3 | 90.0 | 130.0 |
I am joining between them using all three columns and trying to identify columns which are common between the two dataframes and the ones which are not.
I'm looking for output:
+--------+--------------------------------------------
|id | amount | fee |exId | exAmount | exFee |
|1 | 10.00 | 5.0 |1 | 10.0 | 5.0 |
|null| null | null |1 | 10.0 | 5.0 |
|3 | 90 | 130.0|3 | 90.0 | 130.0 |
Basically want the duplicate row in df2 with exId 1 to be listed separately. Any thoughts?
Upvotes: 0
Views: 4600
Reputation: 41957
One of the possible way is to group by all three columns and generate row numbers for each dataframe and use that additional column in addition to the rest three columns while joining. You should get what you desire.
import org.apache.spark.sql.expressions._
def windowSpec1 = Window.partitionBy("id", "amount", "fee").orderBy("fee")
def windowSpec2 = Window.partitionBy("exId", "exAmount", "exFee").orderBy("exFee")
import org.apache.spark.sql.functions._
df1.withColumn("sno", row_number().over(windowSpec1)).join(
df2.withColumn("exSno", row_number().over(windowSpec2)),
col("id") === col("exId") && col("amount") === col("exAmount") && col("fee") === col("exFee") && col("sno") === col("exSno"), "outer")
.drop("sno", "exSno")
.show(false)
and you should be getting
+----+------+-----+----+--------+-----+
|id |amount|fee |exId|exAmount|exFee|
+----+------+-----+----+--------+-----+
|null|null |null |1 |10.0 |5.0 |
|3 |90 |130.0|3 |90 |130.0|
|1 |10.00 |5.0 |1 |10.00 |5.0 |
+----+------+-----+----+--------+-----+
I hope the answer is helpful
Upvotes: 1