Reputation:
I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.
More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.
So the full dataset looks something like:
Site Date Ladybug
Baumgarten 6/24/2014 2
Baumgarten 8/6/2014 0
Baumgarten 8/20/2014 3
Baumgarten 7/8/2014 0
Baumgarten 7/22/2014 1
Berkevich 7/9/2014 0
Berkevich 7/23/2014 4
Berkevich 8/8/2014 0
Berkevich 8/22/2014 0
Boehm 6/24/2014 2
# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"), Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014", "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L, 4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA, -10L))
And my desired subsetted dataset would look something like:
Site Date Ladybugs
Baumgarten 8/20/2014 3
Berkevich 7/9/2014 0
Boehm 6/24/2014 2
I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.
Any advice would be much appreciated. Thanks.
Upvotes: 3
Views: 90
Reputation: 402
A possibly inefficient method, but it gets the job done.
levels <- length(unique(data$Site))
rowselect<- sapply(1:levels, function(x) {
elem <- which(array==unique(array)[x])
if(length(elem)<2){
return(elem)
} else {
return(sample(elem,1))
}
})
this gives the rowindex for 1 randomly selected row for each site.
Upvotes: 0
Reputation: 21497
Using data.table
you can use:
require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]
This gives you
Site Date Ladybug
1: Baumgarten 8/20/2014 3
2: Berkevich 7/23/2014 4
3: Boehm 6/24/2014 2
Upvotes: 2
Reputation: 13149
You could also use base-R for this. It splits the data by site, samples one row and returns that. Then results get bound together.
set.seed(123)
res <- do.call(rbind,lapply(split(dat,dat$Site),function(x){x[sample(nrow(x),1),]}))
Another possibility is data.table:
library(data.table)
setDT(dat)
set.seed(123)
res <- dat[,.SD[sample(.N,1)],Site]
Upvotes: 1
Reputation: 20463
Assuming your data is a data.frame
named df
, you could use dplyr
and do the following:
library(dplyr)
df %>%
group_by(Site) %>%
sample_n(1)
# Source: local data frame [3 x 3]
# Groups: Site [3]
#
# Site Date Ladybug
# (fctr) (fctr) (int)
# 1 Baumgarten 8/20/2014 3
# 2 Berkevich 8/22/2014 0
# 3 Boehm 6/24/2014 2
Upvotes: 3