Reputation: 552
There are 2 DataFrames, df1 is defined as
+----+---------+---------+
|id |value1 |value2 |
+----+---------+---------+
|1 |["J","W"]| 0.3|
|2 | | 0.6|
|3 |["n"] | 0.7|
+----+---------+---------+
df2 is defined as
+----+---------+
|id |value1 |
+----+---------+
| 1 | "t" |
| 2 | "m" |
+----+---------+
is there an easy way to combine the DataFrame as df3
+----+--------------+---------+
|id |value1 |value2 |
+----+--------------+---------+
|1 |["J","W", "t"]| 0.3|
|2 |["m] | 0.6|
|3 |["n"] | 0.7|
+----+--------------+---------+
Upvotes: 0
Views: 242
Reputation: 41957
You should first join
the two dataframes with column value1
of df2
renamed as
val joineddf = df1.join(df2.withColumnRenamed("value1", "value21"), Seq("id"), "left")
Then you should define a udf
function to add the renamed value21
column of df2
as
import org.apache.spark.sql.functions._
def mergeUdf = udf((array: mutable.WrappedArray[String], str: String) => str match{
case null => array
case _ => array ++ Array(str)
})
Finally you should call the udf
function and drop
the unnecessary columns as
joineddf.withColumn("value1", mergeUdf($"value1", $"value21"))
.drop("value21")
You should get your desired output as
+---+---------+------+
|id |value1 |value2|
+---+---------+------+
|1 |[J, W, t]|0.3 |
|2 |[m] |0.6 |
|3 |[n] |0.7 |
+---+---------+------+
Upvotes: 2