Reputation: 55
I need to find a way to sample groups so that the observed proportions match the expected ones. I would like to keep as many of the observations in each group as possible.
Simple example: Group A = 302 (60.3%) Group B = 199 (39.7%)
The proportions I expect are 46.6% and 53.4%, so in this instance I would keep all the observations in Group B and sample Group A by 0.576 to get 174 observations. Is this correct?
Is there anyway to write a rule in SAS or R that would give you the appropriate sampling rate for n groups? My actual problem involves 14 groups with counts ranging from 2 to 77:
Group A = 77 , observed = 21.51%, expected = 15.10%
Group B = 5 , observed = 1.4%, expected = 0.54%
Group C = 2, observed = 0.56%, expected = 1.62%
etc.
Many thanks for your help.
Upvotes: 3
Views: 890
Reputation: 12819
I assume you're drawing a simple random sample (SRS) of your dataset. In that case, it is expected to get "underrepresentation" and/or "overrepresentation" of the groups. As far as inference goes, this is not an issue. If you're drawing the sample to get an estimate of some population characteristic, let's say, a total or a proportion, then you don't have to worry if the frequencies in the sample do not match those in the population. As a matter of fact, these frequencies are only equal on the average, i. e., across all possible samples. This is already "taken into account" by the usual estimators.
On the other hand, it is possible to force the frequencies to match; but we will enter the Realm of Complex Samples. Some nice authors in this area are Särndal et al. (1992) and Tillé (2006). Some googling will show you how widespread their work is. In your practical case, I believe you're looking for a stratified sample, i. e., a sample which is formed by subsamples drawn within population groups. If you draw simple random samples within each group, it is straightfoward to implement a routine in R with no more than 10 lines of code.
But if you want something ready, check out the "sampling" package for R: http://cran.r-project.org/web/packages/sampling/index.html
Beware that if you choose the complex samples approach, you must be extra careful, because this is a theory with many subtleties. The estimators assume different form (google, for instance, the "Horvitz-Thompsom estimator"), their sampling distribution is much harder to describe, and using a normal approximation to this distribution is often a very rough approximation.
Just to mention some of the subtleties involved, in the case of the stratified sample, consider the problem of determining how many sampling units should be allocated to each stratum (population group) given that the sample must have a fixed total number of units. The proportional allocation (i. e. matching the proportions of the groups in the sample and in the population) is not necessarily the best solution. See Cochran (1997) for a brief discussion or the books mentioned above for more details.
Upvotes: 1
Reputation: 7602
I believe you can use PROC SURVEYSELECT to achieve this. You need to store the expected sampling rate per group in a separate dataset, then apply the the option "SAMPRATE=SAS data set" in the PROC SURVEYSELECT statement. See the online documentation on this procedure for more information.
Upvotes: 2
Reputation: 93813
Here is a dodgy little function to play with:
minsamp <- function(obs,expect) {
## get the groups where the number of people available
## isn't enough to simply multiply it out
underrep <- obs[which(obs - expect * sum(obs) < 0)]
# name of the smallest underrepresented group
urname <- names(which.min(underrep))
# get the final result
round(expect * (obs[urname]/expect[urname]))
}
And an example (based on your simple example
:
obs <- c(a=302,b=199)
expect <- c(a=0.466,b=0.534)
> minsamp(obs,expect)
a b
174 199
And you can see it worked:
> prop.table(minsamp(obs,expect))
a b
0.4664879 0.5335121
Upvotes: 1