Reputation: 13088
I've encountered some very strange sorting behaviour in scala (2.11.8
and 2.12.1
) for Seq[(Long, Double)]
. I am probably misunderstanding something fundamental.
Given a sequence with no Double.NaN
values in it everything works as expected
Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((11,11.0), (2,10.0), (1,2.5), (2,0.0))
if I add a tuple with a NaN
in the sort column strange things happen
Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (3L, Double.NaN), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((1,2.5), (2,0.0), (5,NaN), (11,11.0), (2,10.0))
so it looks like nothing was done
if I swap the first two elements
Seq[(Long, Double)]((2L, 0D), (1L, 2.5D), (5L, Double.NaN), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((11,11.0), (2,10.0), (1,2.5), (2,0.0), (5,NaN))
the sorting works again??
The same is observed with just Seq[Double]
Seq[Double](2.5, 0, Double.NaN, 11, 10).sortWith(_ > _)
output >>> Seq[Double] = List(2.5, 0.0, NaN, 11.0, 10.0)
Seq[Double](0, 2.5, Double.NaN, 11, 10).sortWith(_ > _)
output >>> Seq[Double] = List(11.0, 10.0, 2.5, 0.0, NaN)
.sortBy(_._2)
seems to work in all cases. Is this a bug in scala, or in my brain? I'm using scala 2.11.8
and 2.12.1
on Ubuntu 16.04
and Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91
.
update
it turns out that if I reverse the sort order then slightly more predictable things happen
Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (3L, Double.NaN), (11L, 11D)).sortWith(_._2 < _._2)
output >>> Seq[(Long, Double)] = List((2,0.0), (1,2.5), (3,NaN), (11,11.0))
but again adding a NaN
in between breaks the sort
Seq[(Long, Double)] = List((2,0.0), (3,NaN), (1,2.5), (11,11.0))
output >>> Seq[(Long, Double)] = List((1,2.5), (3,NaN), (2,0.0), (11,11.0))
So Seq.sortWith
or .sortBy
just decides to give up when it sees a NaN
?
On a slightly unrelated note, I came across this as spark
was throwing an error (below) when I was trying to sort a sequence of tuples of the kind above. The results above are just from the scala
REPL though, no spark involved there.
java.lang.IllegalArgumentException: Comparison method violates its general contract!
there's also a related question on .max
and .min
with NaN
s min/max of collections containing NaN (handling incomparability in ordering)
Upvotes: 3
Views: 842
Reputation: 37832
The Spark exception is actually a really good hint as to what's going on here: Spark assumes the comparison method defines a Total Order on the input set, which means, among other things, that for every two elements in the set:
A > B OR B > A
Now, the function _ > _
is a Total Order for Doubles, but it is not total over the set of all Doubles and NaN. This can be seen using these simple tests:
scala> Double.NaN > 1.0
res19: Boolean = false
scala> Double.NaN < 1.0
res20: Boolean = false
So, NaN
is not greater nor smaller than 1.0.
Back to the implementation of sortWith
- I didn't check but I'm assuming it makes the same assumption, but instead of throwing an error when it's not the case, its results are simply made unpredictable (order-dependent) when totality is broken.
EDIT:
So Seq.sortWith or .sortBy just decides to give up when it sees a NaN?
No, it simply returns "wrong" results because it makes wrong decisions due to the assumptions not being upheld. There's no test looking for NaN
s and giving up (as @TheArchetypalPaul commented: "Having sort deal with that would require it to check every value wasn't a NaN before comparison. Not worth it").
Upvotes: 5