Reputation: 27
I have two Dataframes,
DF1
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |10 |20 |30 |
| 2 |20 |30 |40 |
| 3 |30 |40 |50 |
+----+-------+------+------+
DF2
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |100 |200 |300 |
| 2 |200 |300 |400 |
| 3 |300 |400 |500 |
+----+-------+------+------+
I am trying to merge these two dataframes by id and add the value columns together. So get something like this.
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |100+10 |200+20|300+30|
| 2 |200+20 |300+30|400+40|
| 3 |300+30 |400+40|500+50|
+----+-------+------+------+
This works fine when I use
// extract the names of the columns to sum
val cols = df1.columns.filter(_!="id")
// join and sum
val result = df1
.join(df2,Seq("id"), "full_outer")
.select( col("id") +: cols.map(c=>df1(c)+df2(c) as c) : _*)
But when one id is missing in one of the dataframes. For example
DF1
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |10 |20 |30 |
| 2 |20 |30 |40 |
| 3 |30 |40 |50 |
| 4 |40 |40 |40
+----+-------+------+------+
DF2
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |100 |200 |300 |
| 2 |200 |300 |400 |
| 3 |300 |400 |500 |
+----+-------+------+------+
I get following values after I merge using the operation I mentioned above.
+----+-------+------+------+
|id | pnl |value2|value3|
+----+-------+------+------+
| 1 |100+10 |200+20|300+30|
| 2 |200+20 |300+30|400+40|
| 3 |300+30 |400+40|500+50|
| 4 |null |null |null |
+----+-------+------+------+
I understand that I am getting this because I don't have that id in df2. So one way I thought of solving this is by using .na.fill(0.0) after the merge
// join and sum
val result = df1
.join(df2,Seq("id"), "full_outer").na.fill(0.0)
.select( col("id") +: cols.map(c=>df1(c)+df2(c) as c) : _*)
But spark doesn't seem to like that and I get the following error.
org.apache.spark.sql.AnalysisException: Reference 'pnl' is ambiguous, could be: pnl, pnl.;
Anyone know a workaround for this? Thanks
Upvotes: 0
Views: 53
Reputation: 394
You could use cols.map(c => coalesce(df1(c), lit(0)) + coalesce(df2(c), lit(0)) as c)
Upvotes: 1