Reputation: 11
I'm reproducing a question that I couldn't find an answer to.
"I got some surprising results when using the svytotal
routine from the survey package with data containing missing values.
Some example code demonstrating the behaviour is included below.
I have a stratified sampling design where I want to estimate the total
income. In some strata some of the incomes are missing. I want to
ignore these missing incomes. I would have expected that
svytotal(\~income, design=mydesign, na.rm=TRUE)
would do the trick.
However, when calculating the estimates 'by hand' the estimates were
different from those obtained from svytotal
. The estimated mean
incomes do agree with each other. It seems that using the na.rm option
with svytotal
is the same as replacing the missing values with zero's,
which is not what I would have expected, especially since this
behaviour seems to differ from that of svymean
. Is there a reason for
this behaviour?
I can of course remove the missing values myself before creating the survey object. However, with many different variables with different missing values, this is not very practical. Is there an easy way to get the behaviour I want?"
library(survey)
library(plyr)
# generate some data
data <- data.frame(
id = 1:20,
stratum = rep(c("a", "b"), each=10),
income = rnorm(20, 100),
n = rep(c(100, 200), each=10)
)
data$income[5] <- NA
# calculate mean and total income for every stratum using survey package
des <- svydesign(ids=~id, strata=~stratum, data=data, fpc=~n)
svyby(~income, by=~stratum, FUN=svytotal, design=des, na.rm=TRUE)
mn <- svyby(~income, by=~stratum, FUN=svymean, design=des, na.rm=TRUE)
mn
n <- svyby(~n, by=~stratum, FUN=svymean, design=des)
# total does not equal mean times number of persons in stratum
mn[2] * n[2]
# calculate mean and total income 'by hand'. This does not give the same total
# as svytotal, but it does give the same mean
ddply(data, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
# when we set income to 0 for missing cases and repeat the previous estimation
# we get the same answer as svytotal (but not svymean)
data2 <- data
data2$income[is.na(data$income )] <- 0
ddply(data2, .(stratum), function(d) {
data.frame(
mean = mean(d$income, na.rm=TRUE),
n = mean(d$n),
total = mean(d$income, na.rm=TRUE) * mean(d$n)
)
})
Upvotes: 1
Views: 1185
Reputation: 2765
Yes, there is a reason for this behaviour!
The easiest way to think about the answer survey
is trying to give here is it sets the weights for the missing observations to zero. That is, the package gives population estimates for the subdomain of non-missing values. This is important for getting the right standard errors. [Note: it doesn't actually do it by just setting the weights to zero, there are some optimisations, but that's the answer it gives]
If you set the weights to zero in svytotal
, you get the sum of the non-missing values, which is the same as you get if you set the missing values to 0 or if they weren't ever sampled. When you come to compute standard errors it matters exactly which one you did, but not for point estimates.
If you set the weights to zero in svymean
you get the mean of the non-missing values, which is not the same as you get if you set the missing values to zero (though it is the same as if they just weren't ever sampled).
I don't know exactly what you mean when you say you want to 'ignore' the missing incomes, but if you want to multiply mn[2]
and n[2]
meaningfully, they need to be computed on the same subdomain: you have one of them computed only where income
is not missing and the other computed on all observations.
Upvotes: 1