using sample() or an equivalent on 2 variables of a dataframe

Question

I have a large dataset, a subset of which looks like this:

Var1    Var2
9     29_13x
14    41y
9     51_13x
4     101_13x
14    105y
14    109y
9     113_13x
9     114_13x
14    116y
14    123y
4     124_13x
14    124y
14    126y
4     134_13x
4     135_13x
4     137_13x
9     138_13x
4     139_13x
14    140y
9     142_13x
4     143_13x

My code sits inside a loop and I would like to be able to sample without replacement, a certain number of Var2 (defined by the loop iteration) from each of the different Var1 categories. So for i=4 I'd like to get something like this:

29_13x
51_13x
113_13x
138_13x

which are all from Var1=9

41y
109y
126y
140y

from Var1=14, and

101_13x
134_13x
137_13x
139_13x

all from Var1=4.

I can't get sample() to work across more than one variable and can't find any other way to do this. Any suggestions would be greatly appreciated.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

Here are two options.

Using sample with by or tapply:

by(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))

Here's some example output with tapply:

out <- tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
out
# $`4`
# [1] "101_13x" "143_13x" "124_13x" "134_13x"
# 
# $`9`
# [1] "114_13x" "113_13x" "142_13x" "29_13x" 
# 
# $`14`
# [1] "116y" "109y" "140y" "105y"

You can also extract individual vectors by index position or by name:

out[[3]]
# [1] "116y" "126y" "124y" "105y"

out[["14"]]
# [1] "116y" "126y" "124y" "105y"

Subsetting based on a random variable ranked by a grouping variable:

x <- rnorm(nrow(mydf))
mydf[ave(x, mydf$Var1, FUN = rank) %in% 1:4, ]

using sample() or an equivalent on 2 variables of a dataframe

Answers (1)

Related Questions