How to calculate frequencies for sequences containing NaNs?

Question

The result of frequencies is wrong when used for sequencies containing NaNs, for example:

=> (frequencies [Double/NaN Double/NaN])
{NaN 1, NaN 1}

instead of expected {NaN 2}.

Furthermore, the running time deteriorates from expected/average O(n) to worst-case O(n^2), e.g.

=> (def v3 (vec (repeatedly 1e3 #(Double/NaN))))
=> (def r (time (frequencies v3)))
"Elapsed time: 36.081751 msecs"
...
=> (def v3 (vec (repeatedly 1e3 #(Double/NaN))))
=> (def r (time (frequencies v3)))
"Elapsed time: 3358.490101 msecs"
...

i.e. 10 times so many elements need 100 times higher running time.

How can frequencies be calculated with (expected/average) O(n) running time, when NaNs are present in the sequence?

As side note:

 => (frequencies (repeat 1e3 Double/NaN))
 {NaN 1000}

yields the expected result, probably because all elements in the sequence are references of the same object.

Alan Thompson · Accepted Answer

Some background on NaN values on the JVM: https://www.baeldung.com/java-not-a-number

You can solve this by encoding the NaN values temporarily while computing the frequencies:

(ns tst.demo.core
  (:use tupelo.core
        tupelo.test))

(defn is-NaN? [x] (.isNaN x))

(defn nan-encode
  [arg]
  (if (is-NaN? arg)
    ::nan
    arg))

(defn nan-decode
  [arg]
  (if (= ::nan arg)
    Double/NaN
    arg))

(defn freq-nan
  [coll]
  (it-> coll
    (mapv nan-encode it)
    (frequencies it)
    (map-keys it nan-decode)))

(dotest
  (let [x [1.0 2.0 2.0 Double/NaN Double/NaN Double/NaN]]
    (is= (spyx (freq-nan x)) {1.0   1,
                              2.0   2,
                              ##NaN 3})))

with result:

-------------------------------
   Clojure 1.10.1    Java 13
-------------------------------

Testing tst.demo.core

(freq-nan x) => {1.0 1, 2.0 2, ##NaN 3}

FAIL in (dotest-line-25) (core.clj:27)
expected: (clojure.core/= (spyx (freq-nan x)) {1.0 1, 2.0 2, ##NaN 3})
  actual: (not (clojure.core/= {1.0 1, 2.0 2, ##NaN 3} {1.0 1, 2.0 2, ##NaN 3}))

Note that even though it calculates & prints the correct result, the unit test still fails since NaN is never equal to anything, even itself. If you want the unit test to pass, you need to leave in the placeholder ::nan like:

(defn freq-nan
  [coll]
  (it-> coll
    (mapv nan-encode it)
    (frequencies it)
  ))

(dotest
  (let [x [1.0 2.0 2.0 Double/NaN Double/NaN Double/NaN]]
    (is= (spyx (freq-nan x)) {1.0   1,
                              2.0   2,
                              ::nan 3})))

How to calculate frequencies for sequences containing NaNs?

Answers (2)

Related Questions