Reputation: 120
I have a dataframe, df2 such as:
ID | data
--------
1 | New
3 | New
5 | New
and a main dataframe, df1:
ID | data | more
----------------
1 | OLD | a
2 | OLD | b
3 | OLD | c
4 | OLD | d
5 | OLD | e
I want to achieve something of the sort:
ID | data | more
----------------
1 | NEW | a
2 | OLD | b
3 | NEW | c
4 | OLD | d
5 | NEW | e
I want to update df1 based on df2, keeping the original values of df1 when they dont exist in df2.
Is there a fast way to do this than using isin? Isin is very slow when df1 and df2 are both very large.
Upvotes: 1
Views: 69
Reputation: 7207
With left join and "coalesce":
val df1 = Seq(
(1, "OLD", "a"),
(2, "OLD", "b"),
(3, "OLD", "c"),
(4, "OLD", "d"),
(5, "OLD", "e")).toDF("ID", "data", "more")
val df2 = Seq(
(1, "New"),
(3, "New"),
(5, "New")).toDF("ID", "data")
// action
val result = df1.alias("df1")
.join(
df2.alias("df2"),$"df2.ID" === $"df1.ID", "left")
.select($"df1.ID",
coalesce($"df2.data", $"df1.data").alias("data"),
$"more")
Output:
+---+----+----+
|ID |data|more|
+---+----+----+
|1 |New |a |
|2 |OLD |b |
|3 |New |c |
|4 |OLD |d |
|5 |New |e |
+---+----+----+
Upvotes: 3