Reputation: 383
I have 2 DF,s like below.
+---+---+---+
| M| c2| c3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
+---+---+---+
| M| c2| c3|
+---+---+---+
| 1| 20| 30|
| 2| 30| 40|
+---+---+---+
What should be the best approach to get a new dataframe like below.This means, the new Df has column names c2 and c3 but value is concat( df1("c1"),df1("c2") )
but with same column name.I can do this with
df3.withColumn("c2_new",concat( df1("c2"),df2("c2") ))
and then renaming the new column to C2. But ssue is that I have 150+ Columns in my DF.What should be the best approach here?
+---+------+-----+
| M| c2 | c3 |
+---+-----+------+
| 1| 2_20| 3_30|
| 2| 3_30| 4_40|
+---+------+-----+
Upvotes: 0
Views: 54
Reputation: 27373
You can do this with a join:
val selectExpr = df1.columns.filterNot(_=="M").map(c => concat_ws("_",df1(c),df2(c)).as(c))
df1.join(df2,"M")
.select((col("M") +: selectExpr):_*)
.show()
gives:
---+----+----+
| M| c2| c3|
+---+----+----+
| 1|2_20|3_30|
| 2|3_30|4_40|
+---+----+----+
Upvotes: 2
Reputation: 4913
If you have a wide columns, you could iterate over columns and apply the same transformations for it. In your case you should merge dataframes and aggregate columns like this:
import org.apache.spark.sql.types.StringType
val commonColumns = (df1.columns.toSet & df2.columns.toSet).filter(_ != "M").toSeq
commonColumns
df1.union(df2)
.groupBy("M")
.agg(count(lit(1)) as "cnt",
commonColumns.map(c => concat_ws("_", collect_set(col(c).cast(StringType))) as c):_*)
.select("M", commonColumns:_*)
.show
Here is the output:
+---+----+----+
| M| c2| c3|
+---+----+----+
| 1|20_2|3_30|
| 2|3_30|40_4|
+---+----+----+
If you have requirement on ordering (i.e. value from df1
must be on the left side, value from df2
must by on the right) you could use this trick:
1
and 2
) before union
as a new columnmin
and max
of this structureCode:
df1
.withColumn("src", lit(1))
.union(df2.withColumn("src", lit(2)))
.groupBy("M")
.agg(count(lit(1)) as "cnt",
commonColumns.map(c => concat(
min(struct($"src", col(c)))(c),
lit("_"),
max(struct($"src", col(c)))(c)) as c):_*)
.select("M", commonColumns:_*)
.show
The final result is ordered:
+---+----+----+
| M| c2| c3|
+---+----+----+
| 1|2_20|3_30|
| 2|3_30|4_40|
+---+----+----+
Upvotes: 2