Cathy
Cathy

Reputation: 61

R summary() gives incorrect values for too many NAs

The Setup

I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary() on it, I get a mean and maximum that are wrong (in disagreement with mean() and max()).

> summary(data, digits = 10)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
 0       0       1       1       1       1 4365239 

When mean() is called separately, it returns a reasonable value:

> mean(data, na.rm = T)
[1] 0.6804823

Characterization of the problem

It looks like this problem is generic to any vector with more than 3162277 NA values in it.

With just under the cutoff:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.5     0.5     1.0     1.0 3162277 

And just over:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

It doesn't seem to matter how many non-missing values there are either.

> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.2     0.5     0.5     0.8     1.0 3162277 
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

Research

  1. In searching for an answer, I came across the well-known rounding error, but that doesn't affect this behavior (see the first code chunk).
  2. I thought this might be some sort of bizarre quirk of my environment/machine/planetary alignment, so I had my sister run the same code. She got the same results on her machine.

Closing remarks

Clearly, this isn't a critical problem because the mean() and max() functions can be used instead of summary(), but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.

EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.

Upvotes: 6

Views: 700

Answers (2)

eac2222
eac2222

Reputation: 191

An update to @thelatemail 's answer:

R (version 4.1.2, at least) will do this for many fewer than 3162277 NA's if your other values are small. For instance, with only 10 NA's,

bar <- c(1:10/100, rep(x=NA, times=10))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0100  0.0325  0.0550  0.0550  0.0775  0.1000      10 
> print.default(summary(bar), digits=12) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0100  0.0325  0.0550  0.0550  0.0775  0.1000 10.0000 

summary() gets it right. But with 10^4 NA's there's some bad rounding,

bar <- c(1:10/100, rep(x=NA, times=1E4))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.1     0.1     0.1     0.1   10000 
> print.default(summary(bar), digits=12) 
      Min.    1st Qu.     Median       Mean    3rd Qu. 
    0.0100     0.0325     0.0550     0.0550     0.0775 
      Max.       NA's 
    0.1000 10000.0000

and with 10^5 NA's everything's rounded to zero:

bar <- c(1:10/100, rep(x=NA, times=1E5))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       0       0  100000 
> print.default(summary(bar), digits=12) 
       Min.     1st Qu.      Median        Mean     3rd Qu. 
     0.0100      0.0325      0.0550      0.0550      0.0775 
       Max.        NA's 
     0.1000 100000.0000 

Upvotes: 0

thelatemail
thelatemail

Reputation: 93813

Here's some example data:

x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000

mean(x, na.rm=TRUE)
#[1] 0.3181818

The issue can be traced back to zapsmall() as it does some rounding in a line that essentially does:

c(out)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06

round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000 

The critical turning point here is 3162277 to 3162278 NA values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.

dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876

dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896

out[7] <- 3162277
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0.0     0.0     0.0     0.3     1.0     1.0 3162277 

out[7] <- 3162278
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#      0       0       0       0       1       1 3162278

Upvotes: 4

Related Questions