Re2124
Re2124

Reputation: 21

Resample from data keeping factor distribution of specific variables

I would like to resample my data with replacement while also keeping the proportion (same amount of 1s and Os in the resampled sample) of my two variables (I and O) constant. This is my data:

dat[,c(2,4,7)]
   I O SIDI.F
1  0 0     50
2  1 0     13
3  1 0     13
4  0 1     12
5  0 0     13
6  0 0     15
7  0 1     23
8  0 1     34

Since I could not find a way, I tried to make it easier and split the data set trying to at least keep the proportions constant for O or I:

> dat3
> O SIDI.F
> 1  0     50
> 2  0     13
> 3  0     13
> 4  1     12
> 5  0     13

> dat2
> I SIDI.F
> 1  0     50
> 2  1     13
> 3  1     13
> 4  0     12
> 5  0     13

datBoot2 <- dat2[sample(1:nrow(dat2), 8, replace=TRUE), ]
datBoot3 <- dat3[sample(1:nrow(dat2), 8, replace=TRUE), ]

However, still I can't find a way to keep the proportions (same number of 1s and 0s in the resampled dataset). Please, can anyone help?

Upvotes: 1

Views: 261

Answers (2)

Re2124
Re2124

Reputation: 21

Thank you all for your answers! I seem to have found a solution. I found a code on a different post on stackoverflow and changes it for sampling with replacement. Although I do not understand the full code, it seems to work:

sampFreq<-function(cdf,col,ns) { 
  x<-as.factor(cdf[,col])  
  freq_x<-table(x)         
  prob_x<-freq_x/sum(freq_x)  
  df_prob = prob_x[as.factor(cdf[,col])]  
  nr=nrow(cdf) 
  sLevels = levels(as.factor(cdf[,col])) 
  nLevels = length(sLevels) 
  rat = ns/nr
  rdata = NULL
  for (is in seq(1,nLevels)) {
    ldata <- cdf[cdf[,col]==sLevels[is],]
    ndata <- nrow(ldata)
    nsdata = max(ndata*rat,1)
    srows <- sample(seq(1,ndata),nsdata,replace=TRUE)
    sdata <- ldata[srows,]
    rdata <- rbind(rdata,sdata)
  }
  return(rdata)
}

datSample <- sampFreq(dat,"I",19)

Checking the proportion of the new sample via the following code seems to indicate the correct proportion.

freq_x<-table(datSample$I)
freq_x/sum(freq_x)

Upvotes: 0

Wimpel
Wimpel

Reputation: 27732

sampling (should?) require a kind of randomness... I believe the rbinom() function can be used here. The probablility of succes (x == 1) is calculated for the prob-argument, based on the original input.

mysample <- function(x) rbinom(length(x), 1, sum(x == 1)/length(x))
mysample(dat$O)

Upvotes: 1

Related Questions