How to update dataframe column in Spark Scala after join?

Question

Join of two dataframes results into almost 60 columns. Most of them suppose to stay as is, but some require update based on values in other columns. Is there a way to update those columns w/o calculating new, removing the originals and renaming the calculated back?

Simplified example: the revenue in $"Sales column from the left dataframe is supposed to be weighted by the $"Weight in the join results. Is there an efficient way to make the calculation w/o generating the $"SalesWeighted as a new column, dropping the original $Sales and re-naming $SalesWeighted into $Sales?

val l = Seq((1, 50), (2, 35), (3, 66))
            .toDF("Id", "Sales")

val r = Seq((1, "Premium", 0.2), (1, "Standard", 0.8), 
            (2, "Premium", 0.4), (2, "Standard", 0.6), 
            (3, "Premium", 0.333), (3, "Standard", 0.333), (3, "Garbage", 0.334))
            .toDF("Id", "Grade", "Weight")

display(l.join(r, Seq("Id")).withColumn("SalesWeighted", $"Sales"*$"Weight")
            .orderBy($"Id", $"Grade"))

Leo C · Accepted Answer

You can simply name the new column the same as the column to be replaced:

l.join(r, Seq("Id")).withColumn("Sales", $"Sales" * $"Weight").
  orderBy($"Id", $"Grade")

Or, just use select:

l.join(r, Seq("Id")).
  select($"Id", $"Grade", $"Weight", ($"Sales" * $"Weight").as("Sales")).
  orderBy($"Id", $"Grade")

How to update dataframe column in Spark Scala after join?

Answers (2)

Related Questions