R - Discrepancy in summary(data) and summary(data$variable)

Question

I have a data set with 61 observations and 2 variables. When I summary the whole data, the quantiles, median, mean and max of the second variable are sometimes different from the result I get from summary the second variable alone. Why is that?

data <- read.csv("testdata.csv")

head(data)
#   Group.1     x
# 1 10/1/12     0
# 2 10/2/12   126
# 3 10/3/12 11352
# 4 10/4/12 12116
# 5 10/5/12 13294
# 6 10/6/12 15420

summary(data)
#   Group.1           x        
# 10/1/12 : 1   Min.   :    0  
# 10/10/12: 1   1st Qu.: 6778  
# 10/11/12: 1   Median :10395  
# 10/12/12: 1   Mean   : 9354  
# 10/13/12: 1   3rd Qu.:12811  
# 10/14/12: 1   Max.   :21194  
# (Other) :55             

summary(data[2])
#       x        
# Min.   :    0  
# 1st Qu.: 6778  
# Median :10395  
# Mean   : 9354  
# 3rd Qu.:12811  
# Max.   :21194  

# The following code yield different result:

summary(data$x)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0    6778   10400    9354   12810   21190

Mikko Marttila · Accepted Answer

@r2evans' comment is correct in that the discrepancy is caused by differences in summary.data.frame and summary.default.

The default value of digits for both methods is max(3L, getOption("digits") - 3L). If you haven't changed your options, this will evaluate to 4L. However, the two methods use their digits argument differently when formatting the output, which is the reason for the differences in the two methods' output. From ?summary:

digits: integer, used for number formatting with signif() (for summary.default) or format() (for summary.data.frame).

Say we have the vector of x´s summary statistics in the question:

q <- append(quantile(data$x), mean(data$x), after = 3L)
q
##   0%      25%      50%               75%     100% 
## 0.00  6778.00 10395.00  9354.23 12811.00 21194.00

In summary.default the output is formatted by using signif, which rounds it's input to the supplied number of significant digits:

signif(q, digits = 4L)
## 0%   25%   50%         75%  100% 
##  0  6778 10400  9354 12810 21190

While summary.data.frame uses format, which uses it's digits argument as only a sugggestion (?format) for the number of significant digits to display:

format(q, digits = 4L)
##      0%     25%     50%             75%    100% 
## "    0" " 6778" "10395" " 9354" "12811" "21194"

Thus, when using the default digits argument value 4, summary.default(data$x) rounds the 5-digit quantiles to only 4 significant digits; but summary.data.frame(data[2]) displays the 5-digit quantiles witout rounding.

If you explicitly supply the digits argument as larger than 4, you'll get identical results:

summary(data[2], digits = 5L)
##        x          
## Min.   :    0.0  
## 1st Qu.: 6778.0  
## Median :10395.0  
## Mean   : 9354.2  
## 3rd Qu.:12811.0  
## Max.   :21194.0  

summary(data$x, digits = 5L)
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.0  6778.0 10395.0  9354.2 12811.0 21194.0

As an extreme example of the differences of the two methods with the default digits:

df <- data.frame(a = 1e5 + 0:100)

summary(df$a)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  100000  100000  100000  100000  100100  100100 

summary(df)
##       a         
## Min.   :100000  
## 1st Qu.:100025  
## Median :100050  
## Mean   :100050  
## 3rd Qu.:100075  
## Max.   :100100

R - Discrepancy in summary(data) and summary(data$variable)

Answers (1)

Related Questions