Will Yu
Will Yu

Reputation: 552

Combine DataFrames with an array column

There are 2 DataFrames, df1 is defined as

 +----+---------+---------+
 |id  |value1   |value2   | 
 +----+---------+---------+
 |1   |["J","W"]|      0.3|
 |2   |         |      0.6|
 |3   |["n"]    |      0.7|
 +----+---------+---------+

df2 is defined as

 +----+---------+
 |id  |value1   |
 +----+---------+
 | 1  | "t"     |
 | 2  | "m"     |
 +----+---------+

is there an easy way to combine the DataFrame as df3

 +----+--------------+---------+
 |id  |value1        |value2   | 
 +----+--------------+---------+
 |1   |["J","W", "t"]|      0.3|
 |2   |["m]          |      0.6|
 |3   |["n"]         |      0.7|
 +----+--------------+---------+

Upvotes: 0

Views: 242

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You should first join the two dataframes with column value1 of df2 renamed as

val joineddf = df1.join(df2.withColumnRenamed("value1", "value21"), Seq("id"), "left")

Then you should define a udf function to add the renamed value21 column of df2 as

import org.apache.spark.sql.functions._
def mergeUdf = udf((array: mutable.WrappedArray[String], str: String) => str match{
  case null => array
  case _ => array ++ Array(str)
})

Finally you should call the udf function and drop the unnecessary columns as

joineddf.withColumn("value1", mergeUdf($"value1", $"value21"))
    .drop("value21")

You should get your desired output as

+---+---------+------+
|id |value1   |value2|
+---+---------+------+
|1  |[J, W, t]|0.3   |
|2  |[m]      |0.6   |
|3  |[n]      |0.7   |
+---+---------+------+

Upvotes: 2

Related Questions