Issue with Spark-scala Join . Looking for a better Approach

Question

I have 2 DF,s like below.

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1| 20| 30|
|  2| 30| 40|
+---+---+---+

What should be the best approach to get a new dataframe like below.This means, the new Df has column names c2 and c3 but value is concat( df1("c1"),df1("c2") ) but with same column name.I can do this with df3.withColumn("c2_new",concat( df1("c2"),df2("c2") )) and then renaming the new column to C2. But ssue is that I have 150+ Columns in my DF.What should be the best approach here?

+---+------+-----+
|  M| c2  |   c3 |
+---+-----+------+
|  1| 2_20|  3_30|
|  2| 3_30|  4_40|
+---+------+-----+

Raphael Roth · Accepted Answer

You can do this with a join:

val selectExpr = df1.columns.filterNot(_=="M").map(c => concat_ws("_",df1(c),df2(c)).as(c))

df1.join(df2,"M")
  .select((col("M") +: selectExpr):_*)
  .show()

gives:

---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

Issue with Spark-scala Join . Looking for a better Approach

Answers (2)

Related Questions