Reputation: 34367
The result of frequencies
is wrong when used for sequencies containing NaN
s, for example:
=> (frequencies [Double/NaN Double/NaN])
{NaN 1, NaN 1}
instead of expected {NaN 2}
.
Furthermore, the running time deteriorates from expected/average O(n)
to worst-case O(n^2)
, e.g.
=> (def v3 (vec (repeatedly 1e3 #(Double/NaN))))
=> (def r (time (frequencies v3)))
"Elapsed time: 36.081751 msecs"
...
=> (def v3 (vec (repeatedly 1e3 #(Double/NaN))))
=> (def r (time (frequencies v3)))
"Elapsed time: 3358.490101 msecs"
...
i.e. 10 times so many elements need 100 times higher running time.
How can frequencies be calculated with (expected/average) O(n)
running time, when NaN
s are present in the sequence?
As side note:
=> (frequencies (repeat 1e3 Double/NaN))
{NaN 1000}
yields the expected result, probably because all elements in the sequence are references of the same object.
Upvotes: 2
Views: 85
Reputation: 1516
NaN is pretty weird in many programming languages, partly because the IEEE 754 standard for floating point numbers defines that NaN should not equal anything, not even itself. It is the "not even itself" part that leads to most of the weird behavior you are seeing. More here, if you are curious: https://github.com/jafingerhut/batman
The sample function below may be adaptable to your needs. It uses :nan-kw in the returned map to indicate how many NaNs were found. If you replace :nan-kw with ##NaN, then the returned map has the disadvantage that you cannot find the count with (get frequency-ret-value ##NaN), because of the weirdness of ##NaN.
(defn frequencies-maybe-nans [s]
(let [separate-nans (group-by #(and (double? %) (Double/isNaN %)) s)
num-nans (count (separate-nans true))]
(merge (frequencies (separate-nans false))
(when-not (zero? num-nans)
{:nan-kw num-nans}))))
(def freqs (frequencies-maybe-nans [1 2 ##NaN 5 5]))
freqs
(get freqs 2)
(get freqs :nan-kw)
Upvotes: 4
Reputation: 29976
Some background on NaN
values on the JVM: https://www.baeldung.com/java-not-a-number
You can solve this by encoding the NaN
values temporarily while computing the frequencies:
(ns tst.demo.core
(:use tupelo.core
tupelo.test))
(defn is-NaN? [x] (.isNaN x))
(defn nan-encode
[arg]
(if (is-NaN? arg)
::nan
arg))
(defn nan-decode
[arg]
(if (= ::nan arg)
Double/NaN
arg))
(defn freq-nan
[coll]
(it-> coll
(mapv nan-encode it)
(frequencies it)
(map-keys it nan-decode)))
(dotest
(let [x [1.0 2.0 2.0 Double/NaN Double/NaN Double/NaN]]
(is= (spyx (freq-nan x)) {1.0 1,
2.0 2,
##NaN 3})))
with result:
-------------------------------
Clojure 1.10.1 Java 13
-------------------------------
Testing tst.demo.core
(freq-nan x) => {1.0 1, 2.0 2, ##NaN 3}
FAIL in (dotest-line-25) (core.clj:27)
expected: (clojure.core/= (spyx (freq-nan x)) {1.0 1, 2.0 2, ##NaN 3})
actual: (not (clojure.core/= {1.0 1, 2.0 2, ##NaN 3} {1.0 1, 2.0 2, ##NaN 3}))
Note that even though it calculates & prints the correct result, the unit test still fails since NaN
is never equal to anything, even itself. If you want the unit test to pass, you need to leave in the placeholder ::nan
like:
(defn freq-nan
[coll]
(it-> coll
(mapv nan-encode it)
(frequencies it)
))
(dotest
(let [x [1.0 2.0 2.0 Double/NaN Double/NaN Double/NaN]]
(is= (spyx (freq-nan x)) {1.0 1,
2.0 2,
::nan 3})))
Upvotes: 3