MarcoD
MarcoD

Reputation: 137

Random sampling with different strata

I have a dataset from which I want to select a random sample of rows, but following some pre-defined rules. This may be a very basic question but I am very new to this and still trying to grasp the basic concepts. My dataset includes some 330 rows of data (I have included a simplified version here) with several columns. I want to sample 50 rows out of the 330 (I kept these numbers in the mock dataset for simplicity as this is part of the problem I am having) with the option to add the predefined rules to the sampling process. Here is a mock version of the data:

bank<-data.frame(matrix(0,nrow=330,ncol=5))
colnames(bank)<-c("id","var1","var2","year","lo")
bank$id<-c(1:330)
bank$var1<-sample(letters[1:5],330,replace=T)
bank$var2<-sample(c("s","r"),330,replace=T)
bank$var3<-sample(2010:2018,330,replace=T)
bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)

The code I used to try to sample the correct number of rows is

library(splitstackshape)
x<-splitstackshape::stratified(indt=bank,group=c("var1","var2","year","lo"),0.151)

However this is not selecting 50 rows. I had initially tried to define size=50 but I got the following error:

Groups b s 2012 lo4,... [there is a very long list here],...contain fewer rows than requested. Returning all rows.

Then I tried to define size as a percent: 0.151 (15.1%?) which should be right 50 out of 330 but that samples 5 rows (I tried 0.5 and samples 44 rows and if I try 0.500000001 it samples 287 rows???).

What am I missing? For the moment I am stuck here.

Once I manage to sample the correct number of rows (50) I would like to define some rules, like: only upto 50% of the sample can have 2018 (bank$year) AND only up to half of the bank$year==2018 rows can have bank$var2=="r". Obviously I don't expect someone to do this for me, but could you please provide some advice on

1- Why am I getting the wrong number of rows (probably just syntax?) 2- what package I should look into if splitstackshape::stratified() is not the best or a good choice to achieve this?

Many thanks! M

Upvotes: 0

Views: 656

Answers (2)

Corey Pembleton
Corey Pembleton

Reputation: 747

Another process, by selecting the n = sample desired for each group, provided by Jenny Bryan here; sampling from groups where you specify n based on the specific sample size per group, samp is the randomized sample per n group; so n will need to be adjusted according to the proportionate amount per group:

bank %>% 
  group_by(var1) %>% 
  nest() %>% 
  mutate(n = c(7,0,9,1,13),
         samp = map2(data, n, sample_n)) %>% 
  select(var1, samp) %>% 
  unnest()

Upvotes: 0

Joseph Clark McIntyre
Joseph Clark McIntyre

Reputation: 1094

I think the issues comes from the fact that your dataset (as you've shared here) is fairly small, you have a large number of strata (5 letters X 2 s or r X 9 years X 6 lo categories), and it's just not possible to take samples of the desired size from within each stratum. When I bump the sample size up to 33,000 and take a sample of 15.1%, I get a sample of size 4,994. Putting size = 50 is requesting a sample of size 50 from each stratum, which is not remotely possible with the data you've shared.

> bank<-data.frame(matrix(0,nrow=33000,ncol=5))
> colnames(bank)<-c("id","var1","var2","year","lo")
> bank$id<-c(1:33000)
> bank$var1<-sample(letters[1:5],33000,replace=T)
> bank$var2<-sample(c("s","r"),33000,replace=T)
> bank$var3<-sample(2010:2018,33000,replace=T)
> bank$lo<-sample(c("lo1","lo2","lo3","lo4","lo5","lo6"),330,replace=T)
> 
> k <- stratified(bank, group = c('var1', 'var2', 'var3', 'lo'), size = .151)
> dim(k)
[1] 4994    6

Upvotes: 1

Related Questions