user2374133
user2374133

Reputation: 193

R how to sample a dataframe by one column?

I have a dataframe, let's call it mtcars. I want a new dataframe sample_mtcars that is a sample of n rows of mtcars PER gear. mtcars has column gear with values 3,4,5 and I would like a new dataframe with a sample of n rows with gear 3, n rows with 4, and n rows with gear 5.

The solution should handle the case where one value of gear has less than the number of rows to be sampled.

What is the best way to do this? Thanks!

Upvotes: 0

Views: 145

Answers (5)

akrun
akrun

Reputation: 886938

Using data.table

 mtcars1 <- mtcars
 setDT(mtcars1)[, .SD[sample(.N, 3)], by=gear]
 #or
 setDT(mtcars1)[mtcars1[, sample(.I,3), by=gear]$V1,]

Or using aggregate to keep the rownames

  mtcars[t(with(mtcars, aggregate(1:nrow(mtcars), list(gear), FUN=sample,3))[,-1]),]
  #               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
  #Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
  #Merc 450SE    16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
  #Merc 450SL    17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
  #Merc 240D     24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
  #Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
  #Fiat 128      32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
  #Ferrari Dino  19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
  #Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
  #Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193507

Maybe this is overkill for your question, but I've written a function called stratified that should work for you.

The features include:

  • Allowing the user to specify multiple grouping variables.
  • Sampling a fixed number of rows from each group.
  • Allowing the user to specify which subsets from the grouping variables should be considered when sampling.

Here's an example:

## (Or just copy and paste the function in your session)
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")

stratified(mtcars, "gear", 3)
#                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
# Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
# Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
# Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Here are some more examples:

If we considered "carb" and "cyl" as our grouping variables, note that some of the combinations have fewer than 3 rows of data:

table(interaction(mtcars[c("carb", "cyl")], drop = TRUE))
# 
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8 
#   5   6   2   4   1   4   3   6   1 

This is how stratified would work, along with the warning it generates:

out1 <- stratified(mtcars, c("carb", "cyl"), 3)
# Some groups
# ---1.6, 6.6, 8.8---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.

Note the rows returned from the above statement, and inspect the first few rows of the result.

table(interaction(out1[c("carb", "cyl")], drop = TRUE))
# 
# 1.4 2.4 1.6 4.6 6.6 2.8 3.8 4.8 8.8 
#   3   3   2   3   1   3   3   3   1 
head(out1)
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

You can also "subset" while sampling. For example, if you only wanted "carb" values of 1, 2, and 4, and "cyl" values of 4 and 8 to be included, you can do:

out2 <- stratified(mtcars, c("carb", "cyl"), 3, 
                   select = list(carb = c(1, 2, 4), 
                                 cyl = c(4, 8)))
out2
#                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
# Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
# Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
# Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

The size argument also accepts a value less than 1 if you wanted to take a percentage of each group. For example, setting size = .25 would sample 25% (rounded) of each group.

Upvotes: 2

talat
talat

Reputation: 70256

Another option would be dplyr:

library(dplyr)
sample_mtcars <- mtcars %>%
    group_by(gear) %>%
    do(sample_n(., 3))

Upvotes: 1

IRTFM
IRTFM

Reputation: 263301

> mtcars[ unlist( tapply( rownames(mtcars), mtcars$gear, sample, 3)) , ]

                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
AMC Javelin        15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Mazda RX4          21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Volvo 142E         21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Porsche 914-2      26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Lotus Europa       30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Upvotes: 3

jbaums
jbaums

Reputation: 27388

Here's one way, using lapply over the result of split:

set.seed(1)
n <- 3
do.call(rbind, lapply(split(mtcars, mtcars$gear), function(x) 
  x[sample(nrow(x), n), ]))

#                       mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 3.Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
# 3.Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
# 3.Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
# 4.Fiat X1-9          27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
# 4.Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# 4.Honda Civic        30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# 5.Maserati Bora      15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
# 5.Ford Pantera L     15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
# 5.Lotus Europa       30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Upvotes: 2

Related Questions