Reputation: 642
I Have two spark DFs which I need to join. Only select Values from df2 which are present in df1 there shouldn't be repeated rows.
For example:
df1:
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101730| 202003101700|1930860788|
| 202003101730| 202003101600| 101713|
| 202003101720| 202003101700|1261542412|
| 202003101720| 202003101600| 1824155|
| 202003101710| 202003101700| 912601761|
+-------------+---------------+----------+
df2:
+-------------+---------------+
|a |b |
+-------------+---------------+
| 202003101800| 202003101700|
| 202003101800| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101740| 202003101700|
| 202003101740| 202003101700|
+-------------+---------------+
I am doing the following:
df1.join(df2, Seq("a", "b"), "leftouter").where(col("val").isNotNull)
But my output has several repeated rows.
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+
I am trying to achieve an except like operation if val is dropped from df1. But except
doesn't seem to work.
For example the following is the desired operation
df1.drop(col("val")).except("df2")
The schema is as follows for df1:
root
|-- a: String (nullable = true)
|-- b: String (nullable = true)
|-- val: long (nullable = true)
Also, What exactly is the difference between left-outer join and except? Expected output:
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+
Upvotes: 3
Views: 4072
Reputation: 555
You can use the function dropDuplicates()
, that remove all duplicated rows:
uniqueDF = df.dropDuplicates()
Or your can specify the columns you wanna match:
uniqueDF = df.dropDuplicates("a","b")
Upvotes: 2
Reputation: 31460
LeftOuter
join will get all the rows from left table and matching rows from right table.
Except
will give rows that are not exist in second dataframe compared to first dataframe(without duplicates).
For your case you can use inner
(or) outer
join with dropDuplicates.
df1.join(df2, Seq("a", "b"), "inner").dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+
df1.join(df2, Seq("a", "b"), "rightouter").where(col("val").isNotNull).dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+
Upvotes: 2