Reputation: 23
I have a data set with 61 observations and 2 variables. When I summary the whole data, the quantiles, median, mean and max of the second variable are sometimes different from the result I get from summary the second variable alone. Why is that?
data <- read.csv("testdata.csv")
head(data)
# Group.1 x
# 1 10/1/12 0
# 2 10/2/12 126
# 3 10/3/12 11352
# 4 10/4/12 12116
# 5 10/5/12 13294
# 6 10/6/12 15420
summary(data)
# Group.1 x
# 10/1/12 : 1 Min. : 0
# 10/10/12: 1 1st Qu.: 6778
# 10/11/12: 1 Median :10395
# 10/12/12: 1 Mean : 9354
# 10/13/12: 1 3rd Qu.:12811
# 10/14/12: 1 Max. :21194
# (Other) :55
summary(data[2])
# x
# Min. : 0
# 1st Qu.: 6778
# Median :10395
# Mean : 9354
# 3rd Qu.:12811
# Max. :21194
# The following code yield different result:
summary(data$x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0 6778 10400 9354 12810 21190
Upvotes: 2
Views: 276
Reputation: 11878
@r2evans' comment is correct in that the discrepancy is caused by differences in summary.data.frame
and summary.default
.
The default value of digits
for both methods is max(3L, getOption("digits") - 3L)
. If you haven't changed your options, this will evaluate to 4L
. However, the two methods use their digits
argument differently when formatting the output, which is the reason for the differences in the two methods' output. From ?summary
:
digits
: integer, used for number formatting withsignif()
(forsummary.default
) orformat()
(forsummary.data.frame
).
Say we have the vector of x
´s summary statistics in the question:
q <- append(quantile(data$x), mean(data$x), after = 3L)
q
## 0% 25% 50% 75% 100%
## 0.00 6778.00 10395.00 9354.23 12811.00 21194.00
In summary.default
the output is formatted by using signif
, which rounds it's input to the supplied number of significant digits
:
signif(q, digits = 4L)
## 0% 25% 50% 75% 100%
## 0 6778 10400 9354 12810 21190
While summary.data.frame
uses format
, which uses it's digits
argument as only a sugggestion (?format
) for the number of significant digits to display:
format(q, digits = 4L)
## 0% 25% 50% 75% 100%
## " 0" " 6778" "10395" " 9354" "12811" "21194"
Thus, when using the default digits
argument value 4
, summary.default(data$x)
rounds the 5-digit quantiles to only 4 significant digits; but summary.data.frame(data[2])
displays the 5-digit quantiles witout rounding.
If you explicitly supply the digits
argument as larger than 4, you'll get identical results:
summary(data[2], digits = 5L)
## x
## Min. : 0.0
## 1st Qu.: 6778.0
## Median :10395.0
## Mean : 9354.2
## 3rd Qu.:12811.0
## Max. :21194.0
summary(data$x, digits = 5L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 6778.0 10395.0 9354.2 12811.0 21194.0
As an extreme example of the differences of the two methods with the default digits
:
df <- data.frame(a = 1e5 + 0:100)
summary(df$a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 100000 100000 100000 100100 100100
summary(df)
## a
## Min. :100000
## 1st Qu.:100025
## Median :100050
## Mean :100050
## 3rd Qu.:100075
## Max. :100100
Upvotes: 1