dinushawiki
dinushawiki

Reputation: 27

How to join two data frames and add fields in spark

I have two dataframes for example,

DF1
+----+-------+------+------+
|id  | value1|value2|value3|
+----+-------+------+------+
|  1 |10     |20    |30    |
|  2 |20     |30    |40    |
|  3 |30     |40    |50    |
+----+-------+------+------+

DF2
+----+-------+------+------+
|id  | value1|value2|value3|
+----+-------+------+------+
|  1 |100    |200   |300   |
|  2 |200    |300   |400   |
|  3 |300    |400   |500   |
+----+-------+------+------+

I want to merge these two dataframes by id and at the same time add the columns with the same name in the two dataframes together. So what I want is something like this

+----+-------+------+------+
|id  | value1|value2|value3|
+----+-------+------+------+
|  1 |100+10 |200+20|300+30|
|  2 |200+20 |300+30|400+40|
|  3 |300+30 |400+40|500+50|
+----+-------+------+------+

Merging part is fairly easy with spark

df1.join(df2,Seq("id"))

Is there an efficient way to add the two columns with the same name in the two dataframes together as above with scala and spark.

Upvotes: 1

Views: 272

Answers (1)

Oli
Oli

Reputation: 10406

In scala it is quite easy to manipulate columns in a generic way. You can do it like this.

// extract the names of the columns to sum
val cols = df1.columns.filter(_!="id") 

// join and sum
val result = df1
    .join(df2,Seq("id"))
    .select( col("id") +: cols.map(c=>df1(c)+df2(c) as c) : _*)

Upvotes: 1

Related Questions