Reputation: 31
I am implemeting solutions for removing duplicate elements from two dataframes using left out join. After performing join condition, I have to check null columns for right table.
val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", "2015-04-23",10),("bob", "2015-04-23", 23)).toDF("name","date","duration")
val df = left.join(right , left("name") === right("name") &&
left("date") === right("date") &&
left("duration").cast(StringType) === right("duration").cast(StringType)
,"left_outer").filter(right("duration").isNull)
But I am unable to filter out integer columns with null values. How can we do null check for integers after join ?
Upvotes: 2
Views: 5494
Reputation: 27373
It's rather unclear what you want to achieve. The way you do it creates ambiguous column names. In addition, you reference the original (source) dataframe (right
) in the filter condition, not the joined dataframe.
If you want to join them, you can do:
val df = left
.join(right , Seq("name","date","duration"),"left_outer")
But will not result in any "null" columns because duplicated columns are removed.
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
Otherwise, you can try this:
val df = left.as('left)
.join(right.as('right) ,
$"left.name" === $"right.name"
and $"left.date" === $"right.date"
and $"left.duration" === $"right.duration"
,"left_outer"
)
.filter($"right.duration".isNull)
this will result in
+----+----------+--------+----+----+--------+
|name| date|duration|name|date|duration|
+----+----------+--------+----+----+--------+
| bob|2015-01-13| 4|null|null| null|
+----+----------+--------+----+----+--------+
EDIT:
If you just want to remove duplicates, you could to this:
val df = left.unionAll(right).distinct()
Upvotes: 1