Reputation: 85
I have a large dataset, a subset of which looks like this:
Var1 Var2
9 29_13x
14 41y
9 51_13x
4 101_13x
14 105y
14 109y
9 113_13x
9 114_13x
14 116y
14 123y
4 124_13x
14 124y
14 126y
4 134_13x
4 135_13x
4 137_13x
9 138_13x
4 139_13x
14 140y
9 142_13x
4 143_13x
My code sits inside a loop and I would like to be able to sample without replacement, a certain number of Var2 (defined by the loop iteration) from each of the different Var1 categories. So for i=4 I'd like to get something like this:
29_13x
51_13x
113_13x
138_13x
which are all from Var1=9
41y
109y
126y
140y
from Var1=14, and
101_13x
134_13x
137_13x
139_13x
all from Var1=4.
I can't get sample()
to work across more than one variable and can't find any other way to do this. Any suggestions would be greatly appreciated.
Upvotes: 0
Views: 41
Reputation: 193677
Here are two options.
Using sample
with by
or tapply
:
by(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
Here's some example output with tapply
:
out <- tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
out
# $`4`
# [1] "101_13x" "143_13x" "124_13x" "134_13x"
#
# $`9`
# [1] "114_13x" "113_13x" "142_13x" "29_13x"
#
# $`14`
# [1] "116y" "109y" "140y" "105y"
You can also extract individual vectors by index position or by name:
out[[3]]
# [1] "116y" "126y" "124y" "105y"
out[["14"]]
# [1] "116y" "126y" "124y" "105y"
Subsetting based on a random variable rank
ed by a grouping variable:
x <- rnorm(nrow(mydf))
mydf[ave(x, mydf$Var1, FUN = rank) %in% 1:4, ]
Upvotes: 0