Reputation: 8263
I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id")
I want to update A("value")
to B("value")
. Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?
Upvotes: 1
Views: 2281
Reputation: 37822
Indeed, a left join and a select
call would do the trick:
// assuming "spark" is an active SparkSession:
import org.apache.spark.sql.functions._
import spark.implicits._
// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")
// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
.select($"id", coalesce($"B.value", $"A.value") as "value")
// result:
// +---+-----+
// | id|value|
// +---+-----+
// | 1| b1|
// | 2| b2|
// | 3| a3|
// +---+-----+
Notice that there's no real "update" here - result
is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.
Upvotes: 1