Reputation: 7200
Triggered by a comment of mine, I read the (very useful) post about how to create a reproducible example and I think this question might be very related to it.
I noticed that sometimes users (a growing number of them) ask here solutions but the introduction is always the same "I have a very large dataset..." and as a result they do not dput
an inch of code.
So I was wondering if there is a way to create a little sample of the data but not with just a head(<data>, n)
because sometimes (most times actually) there are factors etc. that are very important for the purposes of the question and to be successfully, the example data set provided must have (even) just few rows of the different factors in the original data. This lead to the classic dput(head(data))
useless.
Browsing, I found a good solution here which I am about to write down here, but before the question:
are there other ways to do that (of course they are) ? more efficient ones? or more "stable", in the sense that a presence of all factors is guarantee?
Here is the solution I have found:
set.seed(123)
samp_dat <- iris[ sample(1:nrow(iris), 10, replace = F ), ]
samp_dat
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
44 5.0 3.5 1.6 0.6 setosa
118 7.7 3.8 6.7 2.2 virginica
61 5.0 2.0 3.5 1.0 versicolor
130 7.2 3.0 5.8 1.6 virginica
138 6.4 3.1 5.5 1.8 virginica
7 4.6 3.4 1.4 0.3 setosa
77 6.8 2.8 4.8 1.4 versicolor
128 6.1 3.0 4.9 1.8 virginica
79 6.0 2.9 4.5 1.5 versicolor
65 5.6 2.9 3.6 1.3 versicolor
Solutions provided until now are very good (and I have upvoted them) and I do thank posters of course, but please do consider this: the purpose was to create a simple sample of the original data set, so I invite you all to post as easy solutions as possible because it might be that an user asking for help does not have a deep knowledge of R and so I think that avoiding long solution and avoiding solution with external packages (even though I have to admit that the dplyr
one is very easy [with a little knowledge of dplyr
of course] ).
Upvotes: 4
Views: 223
Reputation: 2415
If you just want to maintain a view of every factor probably the easiest way is to just use dplyr
and sample a specific number. For example:
iris %>% group_by(Species) %>% sample_n(3)
Though on a practical standpoint you probably want to do stratified sampling like caret
's create data partition or other packages with more complex sampling approaches.
Upvotes: 2
Reputation: 32456
Here are a couple of options that ensure sampling all the factor combinations
## Factor columns
cols <- sapply(iris, class) == "factor"
## Using dplyr
library(dplyr)
iris %>% group_by(interaction(iris[, cols])) %>%
sample_n(2) -> output
## Base R
do.call(rbind, lapply(split(iris, interaction(iris[, cols])), function(group)
group[sample(nrow(group), 2),]))
Upvotes: 1