Reputation: 43
I have these two data frames called:
table1
+---------+------------+------+
| Animal | Owner |count1|
+---------+------------+------+
| Cat | Bob | 3 |
| Fish | Jerry | 2 |
| Dog | Bob | 2 |
| Turtle | Joe | 5 |
+---------+------------+------+
table2
+---------+------------+------+
| Animal | Owner |count2|
+---------+------------+------+
| Cat | Bob | 2 |
| Fish | Jerry | 1 |
| Dog | Bob | 3 |
| Snake | Kim | 6 |
+---------+------------+------+
I am trying to combine both of these data frames in a way, such that the new data frame below would contain rows
This is the expected output data frame I want to produce.
+---------+------------+------+------+
| Animal | Owner |count1|count2|
+---------+------------+------+------+
| Dog | Bob | 2 | 3 |
| Turtle | Joe | 5 | null |
| Snake | Kim | null | 6 |
+---------+------------+------+------+
Rows that appear in 'table1' and not in 'table2' (or in 'table2' and not in 'table1') can have their count values be 'null'.
Upvotes: 1
Views: 47
Reputation: 1892
In Spark Try full join with filter
condition
scala> var t1 = Seq(("Cat","Bob",3), ("Fish" ,"Jerry" ,2), ("Dog" , "Bob",2), ("Turtle" ,"Joe",5)).toDF("Animal","Owner","count1")
scala> var t2 = Seq(("Cat", "Bob",2),("Fish","Jerry",1),("Dog" ,"Bob",3),("Snake","Kim",6)).toDF("Animal","Owner","count2")
In dataframe t1(table1) and t2(table2) applying full join
with keeping null rows of both count columns from tables.
scala> t2.join(t1,Seq("Animal","Owner"),"full").filter(col("count2")>col("count1") || col("count2").isNull || col("count1").isNull).show
+------+-----+------+------+
|Animal|Owner|count2|count1|
+------+-----+------+------+
| Dog| Bob| 3| 2|
| Snake| Kim| 6| null|
|Turtle| Joe| null| 5|
+------+-----+------+------+
Upvotes: 2