Reputation: 420
I have one data frame as ( This is overall data frame) with 0s and 1s
+---+-----+
|key|value|
+---+-----+
| a| 0.5|
| b| 0.4|
| c| 0.5|
| d| 0.3|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+
and the second dataframe is ( Bad Output ) ( Should contain only 0s )
+---+-----+
|key|value|
+---+-----+
| a| 0.0|
| e| 0.0|
| f| 0.0|
| g| 0.0|
+---+-----+
Note : the value of `a` has chnaged
How to write my script so I can get the following output of my second Dataframe ( only 0s and as value of a
is 1 in good data frame , i want to remove it from bad one
+---+-----+
|key|value|
+---+-----+
| e| 0.0|
| f| 0.0|
| g| 0.0|
| x| 0.0|
| y| 0.0|
| z| 0.0|
+---+-----+
Upvotes: 0
Views: 263
Reputation: 7207
Non-zero overall values can be removed from bad output, and zero overall values added (Scala):
val overall = Seq(
("a", 0.5),
("b", 0.4),
("c", 0.5),
("d", 0.3),
("x", 0.0),
("y", 0.0),
("z", 0.0),
).toDF("key", "value")
val badOutput = Seq(
("a", 0.0),
("e", 0.0),
("f", 0.0),
("g", 0.0)
)
.toDF("key", "value")
badOutput
.except(overall.where($"value"=!=0).withColumn("value", lit(0.0)))
.union (overall.where($"value"===0))
Upvotes: 1
Reputation: 31490
You can union two dataframes then groupBy
+ array_contains
function to get the desired result.
Example:
df.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 1|
#| b| 1|
#| c| 1|
#| d| 1|
#| x| 0|
#| y| 0|
#| z| 0|
#+---+-----+
df1.show()
#+---+-----+
#|key|value|
#+---+-----+
#| a| 0|
#| e| 0|
#| f| 0|
#| g| 0|
#+---+-----+
df2=df.unionAll(df1)
df3=df2.groupBy("key").agg(collect_list(col("value")).alias("lst"))
df3.filter(~array_contains("lst",1)).\
withColumn("lst",array_join(col("lst"),'')).\
show()
#+---+---+
#|key|lst|
#+---+---+
#| x| 0|
#| g| 0|
#| f| 0|
#| e| 0|
#| z| 0|
#| y| 0|
#+---+---+
Upvotes: 0