Reputation: 1094
I have a strange problem. To illustrate:
a <- c(3.099331946117620972814,
3.099331946117621860992)
> unique(a)
[1] 3.099331946117620972814 3.099331946117621860992
> table(a)
a
3.09933194611762
2
So unique()
correctly recognises that the numbers are different after the 15th digit. table()
however does not consider them different.
This may be expected behaviour but it is causing an error in some of my code as I need them both to agree:
times <- sort(unique(times))
k <- as.numeric(table(times))
times is correctly pulling out unique times. k is supposed to be the count of number of times each time occurs, but because of the above issue it doesn't do this correctly.
Anyone have a suggestion to get unique and table to agree? (or other workaround?)
Upvotes: 1
Views: 91
Reputation: 160417
Trying to use unique
or table
on floating-point number is conceptually problematic from the computer's standpoint. This topic is strongly related to the R FAQ 7.31, with an excerpt:
The only numbers that can be represented exactly in R’s numeric type are integers and fractions whose denominator is a power of 2. All other numbers are internally rounded to (typically) 53 binary digits accuracy. As a result, two floating point numbers will not reliably be equal unless they have been computed by the same algorithm, and not always even then. For example,
R> a <- sqrt(2) R> a * a == 2 [1] FALSE R> a * a - 2 [1] 4.440892e-16 R> print(a * a, digits = 18) [1] 2.00000000000000044
(Other examples exist, if curious I encourage you to read more in that FAQ topic.)
Because of this, I suggest you decide on a required precision, then use exactly those digits when looking for uniqueness. Using your numbers, you can force the issue with format
(and sprintf
):
a <- c(3.099331946117620972814,
3.099331946117621860992)
table(format(a, digits = 15))
# 3.09933194611762
# 2
table(format(a, digits = 16))
# 3.099331946117621 3.099331946117622
# 1 1
unique(format(a, digits = 15))
# [1] "3.09933194611762"
unique(format(a, digits = 16))
# [1] "3.099331946117621" "3.099331946117622"
For the curious, the reason unique
and table
are different is rooted somewhere in table
's use of factor
, which in turn uses as.character(y)
. If you do as.character(a)
, it is arbitrarily cutting the precision to 14 digits:
as.character(a)
# [1] "3.09933194611762" "3.09933194611762"
So to answer the question you asked: unique
and table
are different because table
ultimately uses as.character
, which by default truncates to 14 digits here. (Since it's a primitive, you'll have to go into the low-level source to figure that one out.)
The question I answered above is to the underlying assumption that using unique
on floating-point is a good thing to do (which I argue "it is not").
Upvotes: 3