AgiZet
AgiZet

Reputation: 1

stratified sampling R - more directions needed

To reduce the dataset, I have been advised to use "stratified sampling".

Because I'm very new to the R programming, current articles on Stack aren't easy to follow, there is very little explanation.

I have a data set of over 60000 obs. and 24 variables. Out of all variables, 21 are quantitative (numbers).

How do I get sample data out of that? Also - Where do I specify the dataset name - do I need to name the new "reduced" dataset, so I could include it for the further analysis?


ADDED CODE (this is what I used for sampling):

# Sample a percentage of values from each stratum (10% in this case)
DB.quant.sample = lapply(split(DB.quant, DB.quant$group_size), function(DB.quant) {
  DB.quant[sample(1:nrow(DB.quant), ceiling(nrow(DB.quant) * 0.1)), ]
})

Browse[1]>DB.quant[sample(1:nrow(DB.quant), 6000), ]

#DB.quant is the dataset and group_size is one of the variable. I'm not sure which variable should I use?

I'm having problem with illustrating graphically and intuitively how a cluster algorithm works I started:

DB <- na.omit(DB)
DB.quant <- DB[c(2,3,4,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24)]

And then:

d <- dist(DB.quant.sample) # but im getting an error: 
Error in dist(DB.quant.sample, method = "euclidean") : (list) object cannot be coerced to type 'double'

Example image of my DataSet: first few rows of the data

Upvotes: 0

Views: 433

Answers (1)

eipi10
eipi10

Reputation: 93761

I'm not sure exactly how you want to sample, but here's a simple example using the built-in iris data frame. Below are two ways to do it. One using Base R and the other using the dplyr package.

Base R

  1. split the data frame into three separate smaller data frames, one for each Species.
  2. Randomly sample 5 rows per Species.

    # Sample 5 rows from each stratum
    df.sample = lapply(split(iris, iris$Species), function(df) {
      df[sample(1:nrow(df), 5), ]
    })
    
    # Sample a percentage of values from each stratum (10% in this case)
    df.sample = lapply(split(iris, iris$Species), function(df) {
      df[sample(1:nrow(df), ceiling(nrow(df) * 0.1)), ]
    })
    

    This gives us a list containing three data frames, one for each of the three different unique values of Species.

  3. Combine the three samples into a single data frame.

    df.sample = do.call(rbind, df.sample)
    

dplyr package

Do the grouping and sampling in a single chain of functions using the pipe (%>%) operator:

library(dplyr)

# Sample 5 values from each stratum
df.sample = iris %>% 
  group_by(Species) %>% 
  sample_n(5)

# Sample a percentage of values from each stratum (10% in this case)
df.sample = iris %>% 
  group_by(Species) %>% 
  sample_frac(0.1)

Upvotes: 2

Related Questions