Reputation: 1
In the first block of code, I am grouping by VAR1
and I get a mean for VAR2
based on the two categories of VAR1
. However, when I select only for the category of interest from VAR1
, in order to obtain a mean of people for VAR2
who fall into category 1 in VAR1
, I get a different result.
What am I doing wrong?
DF %>%
na.omit() %>%
group_by(VAR1) %>%
summarise(mean(VAR2))
# A tibble: 2 × 2
VAR1 `mean(VAR2)`
<dbl> <dbl>
1 0 12.1
2 1 11.6
> mean(DF[DF$VAR1 == 1, 'VAR2',],na.rm=TRUE)
[1] 11.95238
Upvotes: 0
Views: 59
Reputation: 2944
The discrepancy in the mean values of VAR1
in those cases probably comes from the presence of NAs in another column, as @DanY explained briefly. I will try to elaborate this through the following example. Suppose we have DF that consists of three columns, two of which come from mtcars
and another one is VAR1
. Suppose there are 15 rows with VAR1
of 0
.
library(dplyr)
DF <- mtcars %>% select(mpg, cyl)
DF$VAR1 <- c(rep(0, 15),rep(1,17))
DF
# mpg cyl VAR1
#Mazda RX4 21.0 6 0
#Mazda RX4 Wag 21.0 6 0
#Datsun 710 22.8 4 0
#Hornet 4 Drive 21.4 6 0
#Hornet Sportabout 18.7 8 0
#Valiant 18.1 6 0
#Duster 360 14.3 8 0
#Merc 240D 24.4 4 0
#Merc 230 22.8 4 0
#Merc 280 19.2 6 0
#Merc 280C 17.8 6 0
#Merc 450SE 16.4 8 0
#Merc 450SL 17.3 8 0
#Merc 450SLC 15.2 8 0
#Cadillac Fleetwood 10.4 8 0
#Lincoln Continental 10.4 8 1
#Chrysler Imperial 14.7 8 1
#Fiat 128 32.4 4 1
#Honda Civic 30.4 4 1
#Toyota Corolla 33.9 4 1
#Toyota Corona 21.5 4 1
#Dodge Challenger 15.5 8 1
#AMC Javelin 15.2 8 1
#Camaro Z28 13.3 8 1
#Pontiac Firebird 19.2 8 1
#Fiat X1-9 27.3 4 1
#Porsche 914-2 26.0 4 1
#Lotus Europa 30.4 4 1
#Ford Pantera L 15.8 8 1
#Ferrari Dino 19.7 6 1
#Maserati Bora 15.0 8 1
#Volvo 142E 21.4 4 1
Now, let's change three values of cyl
into NA
.
DF$cyl[3:5] <- NA
head(DF)
# mpg cyl VAR1
#Mazda RX4 21.0 6 0
#Mazda RX4 Wag 21.0 6 0
#Datsun 710 22.8 NA 0
#Hornet 4 Drive 21.4 NA 0
#Hornet Sportabout 18.7 NA 0
#Valiant 18.1 6 0
Now DF
still have 15 values of mpg
with VAR1
of 0
because neither mpg
nor VAR1
has NA
. Now we try to compare the mean of mpg
in the two cases you have compared:
DF %>% na.omit() %>% group_by(VAR1) %>% summarise(mean(mpg))
# A tibble: 2 x 2
# VAR1 `mean(mpg)`
# <dbl> <dbl>
#1 0 18.2
#2 1 21.3
mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
mean(DF[DF$VAR1 == 1, 'mpg',], na.rm = TRUE)
#[1] 21.3
Clearly you got different mean values. It is because in DF %>% na.omit()
operates on DF
so any NA value in any row in DF
will cause omission of that row. That's why the three rows in cyl
that contain NA
s are omitted so that mpg
column has only 12 values, of which the mean is then computed.
DF %>% na.omit()
# mpg cyl VAR1
#Mazda RX4 21.0 6 0
#Mazda RX4 Wag 21.0 6 0
#Valiant 18.1 6 0
#Duster 360 14.3 8 0
#Merc 240D 24.4 4 0
#Merc 230 22.8 4 0
#Merc 280 19.2 6 0
#Merc 280C 17.8 6 0
#Merc 450SE 16.4 8 0
#Merc 450SL 17.3 8 0
#Merc 450SLC 15.2 8 0
#Cadillac Fleetwood 10.4 8 0
#Lincoln Continental 10.4 8 1
#Chrysler Imperial 14.7 8 1
#Fiat 128 32.4 4 1
#Honda Civic 30.4 4 1
#Toyota Corolla 33.9 4 1
#Toyota Corona 21.5 4 1
#Dodge Challenger 15.5 8 1
#AMC Javelin 15.2 8 1
#Camaro Z28 13.3 8 1
#Pontiac Firebird 19.2 8 1
#Fiat X1-9 27.3 4 1
#Porsche 914-2 26.0 4 1
#Lotus Europa 30.4 4 1
#Ford Pantera L 15.8 8 1
#Ferrari Dino 19.7 6 1
#Maserati Bora 15.0 8 1
#Volvo 142E 21.4 4 1
DF %>% na.omit() %>% filter(VAR1 == 0) %>% pull(mpg) %>% mean
# [1] 18.15833
On the other hand, DF[DF$VAR1 == 0, 'mpg',]
operates only on mpg
column, and has nothing to do with cyl
. That's why it has 15 values of mpg
with no NA
so na.rm
will not omit anything.
DF[DF$VAR1 == 0, 'mpg',]
#[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4
#[13] 17.3 15.2 10.4
mean(DF[DF$VAR1 == 0, 'mpg',], na.rm = TRUE)
#[1] 18.72
Upvotes: 1