Matti Lyra
Matti Lyra

Reputation: 13088

scala Seq sortWith or sortBy with NaNs

I've encountered some very strange sorting behaviour in scala (2.11.8 and 2.12.1) for Seq[(Long, Double)]. I am probably misunderstanding something fundamental.

Given a sequence with no Double.NaN values in it everything works as expected

Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((11,11.0), (2,10.0), (1,2.5), (2,0.0))

if I add a tuple with a NaN in the sort column strange things happen

Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (3L, Double.NaN), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((1,2.5), (2,0.0), (5,NaN), (11,11.0), (2,10.0))

so it looks like nothing was done

if I swap the first two elements

Seq[(Long, Double)]((2L, 0D), (1L, 2.5D), (5L, Double.NaN), (11L, 11D), (2L, 10D)).sortWith(_._2 > _._2)
output >>> Seq[(Long, Double)] = List((11,11.0), (2,10.0), (1,2.5), (2,0.0), (5,NaN))

the sorting works again??

The same is observed with just Seq[Double]

Seq[Double](2.5, 0, Double.NaN, 11, 10).sortWith(_ > _)
output >>> Seq[Double] = List(2.5, 0.0, NaN, 11.0, 10.0)

Seq[Double](0, 2.5, Double.NaN, 11, 10).sortWith(_ > _)
output >>> Seq[Double] = List(11.0, 10.0, 2.5, 0.0, NaN)

.sortBy(_._2) seems to work in all cases. Is this a bug in scala, or in my brain? I'm using scala 2.11.8 and 2.12.1 on Ubuntu 16.04 and Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91.

update

it turns out that if I reverse the sort order then slightly more predictable things happen

Seq[(Long, Double)]((1L, 2.5D), (2L, 0D), (3L, Double.NaN), (11L, 11D)).sortWith(_._2 < _._2)
output >>> Seq[(Long, Double)] = List((2,0.0), (1,2.5), (3,NaN), (11,11.0))

but again adding a NaN in between breaks the sort

Seq[(Long, Double)] = List((2,0.0), (3,NaN), (1,2.5), (11,11.0))
output >>> Seq[(Long, Double)] = List((1,2.5), (3,NaN), (2,0.0), (11,11.0))

So Seq.sortWith or .sortBy just decides to give up when it sees a NaN?


On a slightly unrelated note, I came across this as spark was throwing an error (below) when I was trying to sort a sequence of tuples of the kind above. The results above are just from the scala REPL though, no spark involved there.

java.lang.IllegalArgumentException: Comparison method violates its general contract!


there's also a related question on .max and .min with NaNs min/max of collections containing NaN (handling incomparability in ordering)

Upvotes: 3

Views: 842

Answers (1)

Tzach Zohar
Tzach Zohar

Reputation: 37832

The Spark exception is actually a really good hint as to what's going on here: Spark assumes the comparison method defines a Total Order on the input set, which means, among other things, that for every two elements in the set:

A > B OR B > A

Now, the function _ > _ is a Total Order for Doubles, but it is not total over the set of all Doubles and NaN. This can be seen using these simple tests:

scala> Double.NaN > 1.0
res19: Boolean = false

scala> Double.NaN < 1.0
res20: Boolean = false

So, NaN is not greater nor smaller than 1.0.

Back to the implementation of sortWith - I didn't check but I'm assuming it makes the same assumption, but instead of throwing an error when it's not the case, its results are simply made unpredictable (order-dependent) when totality is broken.

EDIT:

So Seq.sortWith or .sortBy just decides to give up when it sees a NaN?

No, it simply returns "wrong" results because it makes wrong decisions due to the assumptions not being upheld. There's no test looking for NaNs and giving up (as @TheArchetypalPaul commented: "Having sort deal with that would require it to check every value wasn't a NaN before comparison. Not worth it").

Upvotes: 5

Related Questions