jmPicaza
jmPicaza

Reputation: 488

Cluster sampling with R or SAS

I do have 155k points distributed in 2k groups. There are 3 kind of points (A+B+C=#points)

Frequency distribution:

  Gr #clients    #A    #B    #C
-------------------------------
  01      100    80    10    10
  02       10     0     3     7

2000      400   300    80    20
--------------------------------
TOTAL: 155000 93000 46500 15500

I want to select random groups of points to a total of 6,000 points, such as the proportions of each type of point in the sample is the same than in the population.

Is there a method for this in R or SAS? or should I perform a simple random survey and then design some algorithm of group substitution till I get the balanced sample?,

Upvotes: 3

Views: 2644

Answers (2)

Turner Bond
Turner Bond

Reputation: 11

EXAMPLE 1: THIS IS HOW I WOULD DO IT IN SAS. If code makes you nervous, use the simpler method in EXAMPLE 2, below.

Note: What you're describing sounds like a proportional sample, not a cluster sample, so that's what I've shown here. Hope that meets your needs.

      /******** sort by strata *****/
      proc sort data=MED_pts_155k ; by GRoup A_B_C clients ; run ;

      /******** create sample design ***/
      proc surveyselect noprint  
      data= MED_pts_155k   
      method=srs  
      seed = 7  
      n = 6000  
      out = sample_design ;  
      strata GRoup A_B_C  /  
        alloc=prop NOSAMPLE
        allocmin = 2  ; /*** min of 2 per stratum.  ****/  
     run ;

    /******** pull sample **********/
    proc surveyselect noprint
      data= MED_pts_155k
      method=sys
      seed = &seed 
      n = sample_design
      out = MY_SAMPLE ;
     strata GRoup A_B_C  ; 
    run ;

The "alloc = prop" option gives you proportional (i.e. 'even') sampling. The "nosample" option in SAS allows you to generate a separate file outlining the sample design. You then use the design in a second stage where you actually pull the sample. If this is too much bother you can leave off the "nosample" option, and go straight to pulling your sample as we as we did in the simpler example below.

Note that in the second step above we've chosen to switch to 'method = SYS', instead of simple random sample (SRS). SRS would work too, but since you may have different types of responses by client, you might want to sample in a representative way across the range of clients. To do that you sort within each stratum by client and intentionally sample in even increments across the range of clients; this is a called a "systematic" sample (SYS).

EXAMPLE 2: SIMPLER

You could also do it all in one simple step if you want less code, and don't need to see the sample design broken down in a separate file.

/******** sort by strata *****/
proc sort data=MED_pts_155k ; by GRoup A_B_C ; run ;

/******** pull sample **********/
proc surveyselect noprint
  data= MED_pts_155k
  method= SRS
  seed = 7 
  n = 6000
  out = MY_SAMPLE ;
 strata GRoup A_B_C  / 
    alloc=prop 
    allocmin = 2  ; 
run ;

In both examples we're assuming you have two stratification variables: 'GRoup' and a second variable 'A_B_C' which contains values of a, b. or c. Hope that helps. Cluster sampling is possible in SAS as well, but as noted above, I've illustrated a proportional sample here since that seems to be what you need. Cluster sampling would take a little more space to describe.

Upvotes: 1

Anthony Damico
Anthony Damico

Reputation: 6104

i don't understand your fake data so i'll make my own.

i'm assuming you construct your own unique groups. i've just used the numbers 1:2000 to do it, but you can run this code on any group type..

# let's make some fake data with 155k points distributed in 2k groups
x <- 
    data.frame(
        groupname = sample( x = 1:2000 , size = 155000 , replace = TRUE ) ,
        anothercol = 1 ,
        andanothercol = "hi"
    )

# look at your data frame `x`
head( x )
# so long as you've constructed a `groupname` variable in your data, it's easy

# calculate the proportion of each group in the total
groupwise.prob <- table( x$groupname ) / nrow( x )
# store that into a probability vector

# convert this to a data frame
prob.frame <- data.frame( groupwise.prob )

head( prob.frame )

# rename the `Var1` column to match your group name variable on `x`
names( prob.frame )[ 1 ] <- 'groupname'

# rename the `Freq` column to say what it is on `x`
names( prob.frame )[ 2 ] <- 'prob'

# merge these individual probabilities back onto your data frame
x <- merge( x , prob.frame , all.x = TRUE )

# now just use the sample function's prob= parameter off of that
# and scale down the size to what you want
recs.to.samp <-
    sample( 
        1:nrow( x ) , 
        size = 6000 , 
        replace = FALSE , 
        prob = x$prob 
    )

# and now here's your new sample, with proportions in tact
y <- x[ recs.to.samp , ]

head( y )

Upvotes: 0

Related Questions