grouping by aggregated (summed) double produces inconsistent results in spark

Question

I'm seeing some inconsistent behavior with Scala and Spark 2.0 around aggregating doubles then grouping by the aggregated value. This only occurs in cluster mode and I believe it has something to do with the order in which the doubles are summed producing a slightly different number. After the initial aggreation, I pivot the results and group by the summed value. Sometimes see 1 row, sometimes 2 rows based on the value changing slightly at the 20th decimal place or so. I can't show a full example, but here is a simplified/contrived version in the REPL which behaves correctly but shows what I'm trying to do:

scala> val df = List((1, "a", 27577661.013638947), (1, "a", 37577661.013538947)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: string ... 1 more field]

scala> df.show
+---+---+--------------------+
|  a|  b|                   c|
+---+---+--------------------+
|  1|  a|2.7577661013638947E7|
|  1|  a| 3.757766101353895E7|
+---+---+--------------------+

scala> val grouped = df.groupBy("a", "b").agg(sum("c").as("c"))
grouped: org.apache.spark.sql.DataFrame = [a: int, b: string ... 1 more field]

scala> grouped.show
+---+---+------------------+
|  a|  b|                 c|
+---+---+------------------+
|  1|  a|6.51553220271779E7|
+---+---+------------------+


scala> val pivoted = grouped.groupBy("c").pivot("a").agg(first("b"))
pivoted: org.apache.spark.sql.DataFrame = [c: double, 1: string]

scala> pivoted.show
+------------------+---+
|                 c|  1|
+------------------+---+
|6.51553220271779E7|  a|
+------------------+---+

The problem shows up after the pivot where I will see 2 rows here instead of the expected single row.

Is this expected? A bug? Any workarounds? I've tried using BigDecimal vs double, rounding, UDF vs column expression, and nothing helps so far. Thank you!

zero323 · Accepted Answer

It is expected:

Floating point arithmetics is not associative. Order of aggregations in Spark is non-deterministic and so is the result.
Floating are not a good choice for grouping keys. They have no meaningful equality (typically you check if difference is smaller than a machine precision). In Spark, where aggregations are based on hashes, you cannot even use this notion of equality.

grouping by aggregated (summed) double produces inconsistent results in spark

Answers (1)

Related Questions