Reputation: 33
I have a data frame that I want to subset by randomly selecting 25 values of ID based on spp == cat
and 25 values of ID based on spp == dog
.
Here is my example data:
ID spp category prop
1 cat small_mam 0.99
2 cat small_mam 0.8
2 cat birds 0.15
3 dog large_mam 1
4 dog med_mam 0.4
4 dog emu 0.6
10 dog med_mam 0.8
10 dog birds 0.2
12 dog reptiles 1
13 dog large_mam 1
14 dog large_mam 1
15 dog large_mam 1
27 cat birds 0.2
28 cat small_mam 1
29 cat small_mam 0.75
29 cat birds 0.25
30 cat small_mam 0.7
30 cat birds 0.2
ID values for spp are unique meaning that cat and dog never have the same ID value. ID ranges from 1 to 696 but is not necessarily unique, this is because ID can be composed of up to 7 categories so randomly sub-setting 25 rows for each species does not work.
The context behind this question is that I will be drawing 1000 random samples of 25 cat and 25 dog scats (UID = the scat ID number) for a bootstrap calculation of dietary overlap using the piankabio function in package(pgirmess).
Thanks in advance for any help.
I am using R version 3.1.3
Upvotes: 3
Views: 149
Reputation: 83275
With data.table you could do it as follows:
library(data.table)
subdf <- setDT(mydf)[, sample(ID, 5), by = spp]
On the example data you provided this gives:
> subdf spp V1 1: cat 27 2: cat 30 3: cat 2 4: cat 28 5: cat 30 6: dog 10 7: dog 14 8: dog 12 9: dog 4 10: dog 15
When you want to keep all columns (which I suppose you want to), you can do:
subdf <- setDT(mydf)[, .SD[sample(.N, 5)], by = spp]
which gives:
> subdf spp ID category prop 1: cat 29 small_mam 0.75 2: cat 1 small_mam 0.99 3: cat 2 birds 0.15 4: cat 30 small_mam 0.70 5: cat 28 small_mam 1.00 6: dog 14 large_mam 1.00 7: dog 15 large_mam 1.00 8: dog 13 large_mam 1.00 9: dog 10 birds 0.20 10: dog 4 med_mam 0.40
Note: I used a sample of 5 for explanatory reasons as the example dataset is not large enough to draw a sample of 25.
In respons to your comment, you can achieve that with:
setDT(mydf)
set.seed(4321)
newdf <- mydf[mydf[, .(ID = sample(unique(ID), 5)), by = spp], on = c("spp", "ID")]
which gives:
> newdf ID spp category prop 1: 27 cat birds 0.20 2: 29 cat small_mam 0.75 3: 29 cat birds 0.25 4: 2 cat small_mam 0.80 5: 2 cat birds 0.15 6: 1 cat small_mam 0.99 7: 28 cat small_mam 1.00 8: 14 dog large_mam 1.00 9: 13 dog large_mam 1.00 10: 15 dog large_mam 1.00 11: 4 dog med_mam 0.40 12: 4 dog emu 0.60 13: 12 dog reptiles 1.00
Explanation: With mydf[, .(ID = sample(unique(ID), 5)), by = spp]
you create an index data.table with 5 unique ID's for each category of spp
. Then you do a join on spp
& ID
where you use this index-data.table to select the part of mydf
with these ID's.
Upvotes: 6