nobody
nobody

Reputation: 8263

Replace one dataframe column value with another's value

I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?

Upvotes: 1

Views: 2281

Answers (1)

Tzach Zohar
Tzach Zohar

Reputation: 37822

Indeed, a left join and a select call would do the trick:

// assuming "spark" is an active SparkSession: 
import org.apache.spark.sql.functions._
import spark.implicits._

// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")

// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
  .select($"id", coalesce($"B.value", $"A.value") as "value")

// result:
// +---+-----+
// | id|value|
// +---+-----+
// |  1|   b1|
// |  2|   b2|
// |  3|   a3|
// +---+-----+

Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.

Upvotes: 1

Related Questions