How to combine (join) information across an Array[DataFrame]

Question

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:

 (0,1.0,0.4,0.1)
 (0,3.0,0.2,0.1)
 (0,5.0,0.4,0.1)

The first column is the ID, and my ideal output for this ID would be:

 (0, 1, 1, 0)

meaning that the second and third columns changed while the third did not. I attach here a bit of data to replicate my setting

val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
                               (1,0.9,0.3,0.3),
                               (2,0.2,0.9,0.2),
                               (3,0.9,0.2,0.2),
                               (4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
                                (1,0.9,0.3,0.3),
                                (2,0.2,0.5,0.2),
                                (3,0.8,0.1,0.1),
                                (4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
                                (1,0.5,0.3,0.3),
                                (2,0.3,0.3,0.5),
                                (3,0.3,0.3,0.1),
                                (4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)

How can I map over the array and get my output?

zero323 · Accepted Answer

You can use countDistinct with groupBy:

import org.apache.spark.sql.functions.{countDistinct}

val exprs = Seq("prop1", "prop2", "prop3")
  .map(c => (countDistinct(c) > 1).cast("integer").alias(c))

val combined = result.reduce(_ unionAll _)

val aggregatedViaGroupBy = combined
  .groupBy($"id")
  .agg(exprs.head, exprs.tail: _*)

aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// |  0|    1|    1|    0|
// |  1|    1|    0|    0|
// |  2|    1|    1|    1|
// |  3|    1|    1|    1|
// |  4|    0|    0|    0|
// +---+-----+-----+-----+

How to combine (join) information across an Array[DataFrame]

Answers (2)

Related Questions