Reputation: 7200

Possible way(s) to create a random sample from a data set by subsetting for dput purposes

Triggered by a comment of mine, I read the (very useful) post about how to create a reproducible example and I think this question might be very related to it.

I noticed that sometimes users (a growing number of them) ask here solutions but the introduction is always the same "I have a very large dataset..." and as a result they do not dput an inch of code.

So I was wondering if there is a way to create a little sample of the data but not with just a head(<data>, n) because sometimes (most times actually) there are factors etc. that are very important for the purposes of the question and to be successfully, the example data set provided must have (even) just few rows of the different factors in the original data. This lead to the classic dput(head(data)) useless.

Browsing, I found a good solution here which I am about to write down here, but before the question:

are there other ways to do that (of course they are) ? more efficient ones? or more "stable", in the sense that a presence of all factors is guarantee?

Here is the solution I have found:

 set.seed(123)
 samp_dat <- iris[ sample(1:nrow(iris), 10, replace = F ), ]
 samp_dat
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
44           5.0         3.5          1.6         0.6     setosa
118          7.7         3.8          6.7         2.2  virginica
61           5.0         2.0          3.5         1.0 versicolor
130          7.2         3.0          5.8         1.6  virginica
138          6.4         3.1          5.5         1.8  virginica
7            4.6         3.4          1.4         0.3     setosa
77           6.8         2.8          4.8         1.4 versicolor
128          6.1         3.0          4.9         1.8  virginica
79           6.0         2.9          4.5         1.5 versicolor
65           5.6         2.9          3.6         1.3 versicolor

Edit

Solutions provided until now are very good (and I have upvoted them) and I do thank posters of course, but please do consider this: the purpose was to create a simple sample of the original data set, so I invite you all to post as easy solutions as possible because it might be that an user asking for help does not have a deep knowledge of R and so I think that avoiding long solution and avoiding solution with external packages (even though I have to admit that the dplyr one is very easy [with a little knowledge of dplyr of course] ).

Upvotes: 4

Answers (2)

chappers

Reputation: 2415

If you just want to maintain a view of every factor probably the easiest way is to just use dplyr and sample a specific number. For example:

iris %>% group_by(Species) %>% sample_n(3)

Though on a practical standpoint you probably want to do stratified sampling like caret's create data partition or other packages with more complex sampling approaches.

Upvotes: 2

Rorschach

Reputation: 32456

Here are a couple of options that ensure sampling all the factor combinations

## Factor columns
cols <- sapply(iris, class) == "factor"

## Using dplyr
library(dplyr)
iris %>% group_by(interaction(iris[, cols])) %>%
  sample_n(2) -> output

## Base R
do.call(rbind, lapply(split(iris, interaction(iris[, cols])), function(group)
    group[sample(nrow(group), 2),]))

Upvotes: 1

Possible way(s) to create a random sample from a data set by subsetting for dput purposes

Edit

Answers (2)

Related Questions