Reputation: 524
Join of two dataframes results into almost 60 columns. Most of them suppose to stay as is, but some require update based on values in other columns. Is there a way to update those columns w/o calculating new, removing the originals and renaming the calculated back?
Simplified example: the revenue in $"Sales
column from the left dataframe is supposed to be weighted by the $"Weight
in the join results. Is there an efficient way to make the calculation w/o generating the $"SalesWeighted
as a new column, dropping the original $Sales
and re-naming $SalesWeighted
into $Sales
?
val l = Seq((1, 50), (2, 35), (3, 66))
.toDF("Id", "Sales")
val r = Seq((1, "Premium", 0.2), (1, "Standard", 0.8),
(2, "Premium", 0.4), (2, "Standard", 0.6),
(3, "Premium", 0.333), (3, "Standard", 0.333), (3, "Garbage", 0.334))
.toDF("Id", "Grade", "Weight")
display(l.join(r, Seq("Id")).withColumn("SalesWeighted", $"Sales"*$"Weight")
.orderBy($"Id", $"Grade"))
Upvotes: 0
Views: 1800
Reputation: 22449
You can simply name the new column the same as the column to be replaced:
l.join(r, Seq("Id")).withColumn("Sales", $"Sales" * $"Weight").
orderBy($"Id", $"Grade")
Or, just use select
:
l.join(r, Seq("Id")).
select($"Id", $"Grade", $"Weight", ($"Sales" * $"Weight").as("Sales")).
orderBy($"Id", $"Grade")
Upvotes: 1
Reputation: 4010
Use Drop to remove the unnecessary columns
val l = Seq((1, 50), (2, 35), (3, 66))
.toDF("Id", "Sales")
val r = Seq((1, "Premium", 0.2), (1, "Standard", 0.8),
(2, "Premium", 0.4), (2, "Standard", 0.6),
(3, "Premium", 0.333), (3, "Standard", 0.333), (3, "Garbage", 0.334))
.toDF("Id", "Grade", "Weight")
display(l.join(r, Seq("Id")).withColumn("SalesWeighted", $"Sales"*$"Weight").drop($"Sales")
.orderBy($"Id", $"Grade"))
Upvotes: 1