Reputation: 193
I have a dataframe, let's call it mtcars. I want a new dataframe sample_mtcars that is a sample of n rows of mtcars PER gear. mtcars has column gear with values 3,4,5 and I would like a new dataframe with a sample of n rows with gear 3, n rows with 4, and n rows with gear 5.
The solution should handle the case where one value of gear has less than the number of rows to be sampled.
What is the best way to do this? Thanks!
Upvotes: 0
Views: 145
Reputation: 886938
Using data.table
mtcars1 <- mtcars
setDT(mtcars1)[, .SD[sample(.N, 3)], by=gear]
#or
setDT(mtcars1)[mtcars1[, sample(.I,3), by=gear]$V1,]
Or using aggregate
to keep the rownames
mtcars[t(with(mtcars, aggregate(1:nrow(mtcars), list(gear), FUN=sample,3))[,-1]),]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Upvotes: 2
Reputation: 193507
Maybe this is overkill for your question, but I've written a function called stratified
that should work for you.
The features include:
Here's an example:
## (Or just copy and paste the function in your session)
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")
stratified(mtcars, "gear", 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Here are some more examples:
If we considered "carb" and "cyl" as our grouping variables, note that some of the combinations have fewer than 3 rows of data:
table(interaction(mtcars[c("carb", "cyl")], drop = TRUE))
#
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8
# 5 6 2 4 1 4 3 6 1
This is how stratified
would work, along with the warning it generates:
out1 <- stratified(mtcars, c("carb", "cyl"), 3)
# Some groups
# ---1.6, 6.6, 8.8---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
Note the rows returned from the above statement, and inspect the first few rows of the result.
table(interaction(out1[c("carb", "cyl")], drop = TRUE))
#
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8
# 3 3 2 3 1 3 3 3 1
head(out1)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
You can also "subset" while sampling. For example, if you only wanted "carb" values of 1, 2, and 4, and "cyl" values of 4 and 8 to be included, you can do:
out2 <- stratified(mtcars, c("carb", "cyl"), 3,
select = list(carb = c(1, 2, 4),
cyl = c(4, 8)))
out2
# mpg cyl disp hp drat wt qsec vs am gear carb
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
The size
argument also accepts a value less than 1 if you wanted to take a percentage of each group. For example, setting size = .25
would sample 25% (rounded) of each group.
Upvotes: 2
Reputation: 70256
Another option would be dplyr:
library(dplyr)
sample_mtcars <- mtcars %>%
group_by(gear) %>%
do(sample_n(., 3))
Upvotes: 1
Reputation: 263301
> mtcars[ unlist( tapply( rownames(mtcars), mtcars$gear, sample, 3)) , ]
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Upvotes: 3
Reputation: 27388
Here's one way, using lapply
over the result of split
:
set.seed(1)
n <- 3
do.call(rbind, lapply(split(mtcars, mtcars$gear), function(x)
x[sample(nrow(x), n), ]))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 3.Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
# 3.Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
# 3.Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
# 4.Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# 4.Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# 4.Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# 5.Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
# 5.Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
# 5.Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Upvotes: 2