Reputation: 61
I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary()
on it, I get a mean and maximum that are wrong (in disagreement with mean()
and max()
).
> summary(data, digits = 10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 1 1 1 1 4365239
When mean()
is called separately, it returns a reasonable value:
> mean(data, na.rm = T)
[1] 0.6804823
It looks like this problem is generic to any vector with more than 3162277 NA values in it.
With just under the cutoff:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.5 0.5 1.0 1.0 3162277
And just over:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
It doesn't seem to matter how many non-missing values there are either.
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.2 0.5 0.5 0.8 1.0 3162277
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
Clearly, this isn't a critical problem because the mean()
and max()
functions can be used instead of summary()
, but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.
EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.
Upvotes: 6
Views: 700
Reputation: 191
An update to @thelatemail 's answer:
R (version 4.1.2, at least) will do this for many fewer than 3162277 NA's if your other values are small. For instance, with only 10 NA's,
bar <- c(1:10/100, rep(x=NA, times=10))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10.0000
summary()
gets it right. But with 10^4 NA's there's some bad rounding,
bar <- c(1:10/100, rep(x=NA, times=1E4))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.1 0.1 0.1 0.1 10000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 10000.0000
and with 10^5 NA's everything's rounded to zero:
bar <- c(1:10/100, rep(x=NA, times=1E5))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 0 0 100000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 100000.0000
Upvotes: 0
Reputation: 93813
Here's some example data:
x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
mean(x, na.rm=TRUE)
#[1] 0.3181818
The issue can be traced back to zapsmall()
as it does some rounding in a line that essentially does:
c(out)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06
round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
The critical turning point here is 3162277
to 3162278
NA
values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.
dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876
dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896
out[7] <- 3162277
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.0 0.0 0.0 0.3 1.0 1.0 3162277
out[7] <- 3162278
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 3162278
Upvotes: 4