Reputation: 488
I do have 155k points distributed in 2k groups. There are 3 kind of points (A+B+C=#points)
Frequency distribution:
Gr #clients #A #B #C
-------------------------------
01 100 80 10 10
02 10 0 3 7
2000 400 300 80 20
--------------------------------
TOTAL: 155000 93000 46500 15500
I want to select random groups of points to a total of 6,000 points, such as the proportions of each type of point in the sample is the same than in the population.
Is there a method for this in R or SAS? or should I perform a simple random survey and then design some algorithm of group substitution till I get the balanced sample?,
Upvotes: 3
Views: 2644
Reputation: 11
Note: What you're describing sounds like a proportional sample, not a cluster sample, so that's what I've shown here. Hope that meets your needs.
/******** sort by strata *****/
proc sort data=MED_pts_155k ; by GRoup A_B_C clients ; run ;
/******** create sample design ***/
proc surveyselect noprint
data= MED_pts_155k
method=srs
seed = 7
n = 6000
out = sample_design ;
strata GRoup A_B_C /
alloc=prop NOSAMPLE
allocmin = 2 ; /*** min of 2 per stratum. ****/
run ;
/******** pull sample **********/
proc surveyselect noprint
data= MED_pts_155k
method=sys
seed = &seed
n = sample_design
out = MY_SAMPLE ;
strata GRoup A_B_C ;
run ;
The "alloc = prop" option gives you proportional (i.e. 'even') sampling. The "nosample" option in SAS allows you to generate a separate file outlining the sample design. You then use the design in a second stage where you actually pull the sample. If this is too much bother you can leave off the "nosample" option, and go straight to pulling your sample as we as we did in the simpler example below.
Note that in the second step above we've chosen to switch to 'method = SYS', instead of simple random sample (SRS). SRS would work too, but since you may have different types of responses by client, you might want to sample in a representative way across the range of clients. To do that you sort within each stratum by client and intentionally sample in even increments across the range of clients; this is a called a "systematic" sample (SYS).
You could also do it all in one simple step if you want less code, and don't need to see the sample design broken down in a separate file.
/******** sort by strata *****/
proc sort data=MED_pts_155k ; by GRoup A_B_C ; run ;
/******** pull sample **********/
proc surveyselect noprint
data= MED_pts_155k
method= SRS
seed = 7
n = 6000
out = MY_SAMPLE ;
strata GRoup A_B_C /
alloc=prop
allocmin = 2 ;
run ;
In both examples we're assuming you have two stratification variables: 'GRoup' and a second variable 'A_B_C' which contains values of a, b. or c. Hope that helps. Cluster sampling is possible in SAS as well, but as noted above, I've illustrated a proportional sample here since that seems to be what you need. Cluster sampling would take a little more space to describe.
Upvotes: 1
Reputation: 6104
i don't understand your fake data so i'll make my own.
i'm assuming you construct your own unique groups. i've just used the numbers 1:2000
to do it, but you can run this code on any group type..
# let's make some fake data with 155k points distributed in 2k groups
x <-
data.frame(
groupname = sample( x = 1:2000 , size = 155000 , replace = TRUE ) ,
anothercol = 1 ,
andanothercol = "hi"
)
# look at your data frame `x`
head( x )
# so long as you've constructed a `groupname` variable in your data, it's easy
# calculate the proportion of each group in the total
groupwise.prob <- table( x$groupname ) / nrow( x )
# store that into a probability vector
# convert this to a data frame
prob.frame <- data.frame( groupwise.prob )
head( prob.frame )
# rename the `Var1` column to match your group name variable on `x`
names( prob.frame )[ 1 ] <- 'groupname'
# rename the `Freq` column to say what it is on `x`
names( prob.frame )[ 2 ] <- 'prob'
# merge these individual probabilities back onto your data frame
x <- merge( x , prob.frame , all.x = TRUE )
# now just use the sample function's prob= parameter off of that
# and scale down the size to what you want
recs.to.samp <-
sample(
1:nrow( x ) ,
size = 6000 ,
replace = FALSE ,
prob = x$prob
)
# and now here's your new sample, with proportions in tact
y <- x[ recs.to.samp , ]
head( y )
Upvotes: 0