Reputation: 45
I'm seeing some inconsistent behavior with Scala and Spark 2.0 around aggregating doubles then grouping by the aggregated value. This only occurs in cluster mode and I believe it has something to do with the order in which the doubles are summed producing a slightly different number. After the initial aggreation, I pivot the results and group by the summed value. Sometimes see 1 row, sometimes 2 rows based on the value changing slightly at the 20th decimal place or so. I can't show a full example, but here is a simplified/contrived version in the REPL which behaves correctly but shows what I'm trying to do:
scala> val df = List((1, "a", 27577661.013638947), (1, "a", 37577661.013538947)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: string ... 1 more field]
scala> df.show
+---+---+--------------------+
| a| b| c|
+---+---+--------------------+
| 1| a|2.7577661013638947E7|
| 1| a| 3.757766101353895E7|
+---+---+--------------------+
scala> val grouped = df.groupBy("a", "b").agg(sum("c").as("c"))
grouped: org.apache.spark.sql.DataFrame = [a: int, b: string ... 1 more field]
scala> grouped.show
+---+---+------------------+
| a| b| c|
+---+---+------------------+
| 1| a|6.51553220271779E7|
+---+---+------------------+
scala> val pivoted = grouped.groupBy("c").pivot("a").agg(first("b"))
pivoted: org.apache.spark.sql.DataFrame = [c: double, 1: string]
scala> pivoted.show
+------------------+---+
| c| 1|
+------------------+---+
|6.51553220271779E7| a|
+------------------+---+
The problem shows up after the pivot where I will see 2 rows here instead of the expected single row.
Is this expected? A bug? Any workarounds? I've tried using BigDecimal vs double, rounding, UDF vs column expression, and nothing helps so far. Thank you!
Upvotes: 0
Views: 2492
Reputation: 330073
It is expected:
Upvotes: 1