kukroid
kukroid

Reputation: 420

How to intersect/union pyspark dataframes with different values

I have one data frame as ( This is overall data frame) with 0s and 1s

+---+-----+                                                                     
|key|value|
+---+-----+
|  a|  0.5|
|  b|  0.4|
|  c|  0.5|
|  d|  0.3|
|  x|  0.0|
|  y|  0.0|
|  z|  0.0|
+---+-----+

and the second dataframe is ( Bad Output ) ( Should contain only 0s )

+---+-----+
|key|value|
+---+-----+
|  a|  0.0|
|  e|  0.0|
|  f|  0.0|
|  g|  0.0|
+---+-----+

Note : the value of `a` has chnaged

How to write my script so I can get the following output of my second Dataframe ( only 0s and as value of a is 1 in good data frame , i want to remove it from bad one

+---+-----+
|key|value|
+---+-----+
|  e|  0.0|
|  f|  0.0|
|  g|  0.0|
|  x|  0.0|
|  y|  0.0|
|  z|  0.0|
+---+-----+

Upvotes: 0

Views: 263

Answers (2)

pasha701
pasha701

Reputation: 7207

Non-zero overall values can be removed from bad output, and zero overall values added (Scala):

val overall = Seq(
  ("a", 0.5),
  ("b", 0.4),
  ("c", 0.5),
  ("d", 0.3),
  ("x", 0.0),
  ("y", 0.0),
  ("z", 0.0),
).toDF("key", "value")

val badOutput = Seq(
  ("a", 0.0),
  ("e", 0.0),
  ("f", 0.0),
  ("g", 0.0)
)
  .toDF("key", "value")

badOutput
  .except(overall.where($"value"=!=0).withColumn("value", lit(0.0)))
  .union (overall.where($"value"===0))

Upvotes: 1

notNull
notNull

Reputation: 31490

You can union two dataframes then groupBy + array_contains function to get the desired result.

Example:

df.show()
#+---+-----+
#|key|value|
#+---+-----+
#|  a|    1|
#|  b|    1|
#|  c|    1|
#|  d|    1|
#|  x|    0|
#|  y|    0|
#|  z|    0|
#+---+-----+

df1.show()
#+---+-----+
#|key|value|
#+---+-----+
#|  a|    0|
#|  e|    0|
#|  f|    0|
#|  g|    0|
#+---+-----+

df2=df.unionAll(df1)

df3=df2.groupBy("key").agg(collect_list(col("value")).alias("lst"))

df3.filter(~array_contains("lst",1)).\
withColumn("lst",array_join(col("lst"),'')).\
show()
#+---+---+
#|key|lst|
#+---+---+
#|  x|  0|
#|  g|  0|
#|  f|  0|
#|  e|  0|
#|  z|  0|
#|  y|  0|
#+---+---+

Upvotes: 0

Related Questions