Reputation: 1388
We have an Impala table with ~1 billion rows and a value column of type Double. When we run the same 'select {dimensions}, sum(value) from table group by {dimensions}' query multiple times consecutively on the same table, we get slightly different sums each time. This also happens when we sum rounded values. What could be the causes of this variability? Are there any ways to get around this?
Upvotes: 1
Views: 696
Reputation: 4334
Because the execution of an Impala query is distributed, the order in which some computation occurs may vary as a result of network variability or other processes, and because floating point arithmetic is not associative [1], this can result in the behavior you're seeing. This is exactly why the DECIMAL datatype exists.
Upvotes: 2