Reputation: 315
I have two dataframe
df1 :
+---------------+-------------------+-----+------------------------+------------------------+---------+
|id |dt |speed|stats |lag_stat |lag_speed|
+---------------+-------------------+-----+------------------------+------------------------+---------+
|358899055773504|2018-07-31 18:38:36|0 |[9, -1, -1, 13, 0, 1, 0]|null |null |
|358899055773504|2018-07-31 18:58:34|0 |[9, 0, -1, 22, 0, 1, 0] |[9, -1, -1, 13, 0, 1, 0]|0 |
|358899055773505|2018-07-31 18:54:23|4 |[9, 0, 0, 22, 1, 1, 1] |null |null |
+---------------+-------------------+-----+------------------------+------------------------+---------+
df2 :
+---------------+-------------------+-----+------------------------+
|id |dt |speed|stats |
+---------------+-------------------+-----+------------------------+
|358899055773504|2018-07-31 18:38:34|0 |[9, -1, -1, 13, 0, 1, 0]|
|358899055773505|2018-07-31 18:48:23|4 |[8, -1, 0, 22, 1, 1, 1] |
+---------------+-------------------+-----+------------------------+
I want to replace the null value in column lag_stat,speed in df1 with the value of stat and speed from dataframe df2 wrt to the same id.
Desired output looks like this:
+---------------+-------------------+-----+--------------------+--------------------+---------+
| id| dt|speed| stats| lag_stat|lag_speed|
+---------------+-------------------+-----+--------------------+--------------------+---------+
|358899055773504|2018-07-31 18:38:36| 0|[9, -1, -1, 13, 0, 1,0]|[9, -1, -1, 13, 0, 1, 0]| 0|
|358899055773504|2018-07-31 18:58:34| 0|[9, 0, -1, 22, 0, 1, 0]|[9, -1, -1, 13, 0, 1, 0]| 0|
|358899055773505|2018-07-31 18:54:23| 4|[9, 0, 0, 22, 1, 1, 1]|[8, -1, 0, 22, 1, 1, 1] | 4 |
+---------------+-------------------+-----+--------------------+--------------------+---------+
Upvotes: 2
Views: 1521
Reputation: 1525
One possible way could be join
the DFs and then apply some when
functions on that columns.
For example, this:
val output = df1.join(df2, df1.col("id")===df2.col("id"))
.select(df1.col("id"),
df1.col("dt"),
df1.col("speed"),
df1.col("stats"),
when(df1.col("lag_stat").isNull,df2.col("stats")).otherwise(df1.col("lag_stat")).alias("lag_stats"),
when(df1.col("lag_speed").isNull,df2.col("speed")).otherwise(df1.col("lag_speed")).alias("lag_speed")
)
will give you the expected output:
+---------------+------------------+-----+------------------+------------------+---------+
| id| dt|speed| stats| lag_stats|lag_speed|
+---------------+------------------+-----+------------------+------------------+---------+
|358899055773504|2018-07-3118:38:36| 0|[9,-1,-1,13,0,1,0]|[9,-1,-1,13,0,1,0]| 0|
|358899055773504|2018-07-3118:58:34| 0| [9,0,-1,22,0,1,0]|[9,-1,-1,13,0,1,0]| 0|
|358899055773505|2018-07-3118:54:23| 4| [9,0,0,22,1,1,1]| [8,-1,0,22,1,1,1]| 4|
+---------------+------------------+-----+------------------+------------------+---------+
Upvotes: 5