user5655337
user5655337

Reputation:

Randomly subsetting 1 observation per site and date

I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.

More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.

So the full dataset looks something like:

Site        Date        Ladybug
Baumgarten  6/24/2014   2
Baumgarten  8/6/2014    0
Baumgarten  8/20/2014   3
Baumgarten  7/8/2014    0
Baumgarten  7/22/2014   1
Berkevich   7/9/2014    0
Berkevich   7/23/2014   4
Berkevich   8/8/2014    0
Berkevich   8/22/2014   0
Boehm       6/24/2014   2

# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"),  Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014",  "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L,  4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA,  -10L))

And my desired subsetted dataset would look something like:

Site        Date        Ladybugs
Baumgarten  8/20/2014   3
Berkevich   7/9/2014    0
Boehm       6/24/2014   2

I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.

Any advice would be much appreciated. Thanks.

Upvotes: 3

Views: 90

Answers (4)

Jonas Coussement
Jonas Coussement

Reputation: 402

A possibly inefficient method, but it gets the job done.

levels <- length(unique(data$Site))
rowselect<- sapply(1:levels, function(x) {
  elem <- which(array==unique(array)[x])
  if(length(elem)<2){
    return(elem)
  } else {
    return(sample(elem,1))
  }
})

this gives the rowindex for 1 randomly selected row for each site.

Upvotes: 0

Rentrop
Rentrop

Reputation: 21497

Using data.table you can use:

require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]

This gives you

         Site      Date Ladybug
1: Baumgarten 8/20/2014       3
2:  Berkevich 7/23/2014       4
3:      Boehm 6/24/2014       2

Upvotes: 2

Heroka
Heroka

Reputation: 13149

You could also use base-R for this. It splits the data by site, samples one row and returns that. Then results get bound together.

set.seed(123)

res <- do.call(rbind,lapply(split(dat,dat$Site),function(x){x[sample(nrow(x),1),]}))

Another possibility is data.table:

library(data.table)
setDT(dat)
set.seed(123)
res <- dat[,.SD[sample(.N,1)],Site]

Upvotes: 1

JasonAizkalns
JasonAizkalns

Reputation: 20463

Assuming your data is a data.frame named df, you could use dplyr and do the following:

library(dplyr)

df %>%
  group_by(Site) %>%
  sample_n(1)

# Source: local data frame [3 x 3]
# Groups: Site [3]
#  
#         Site      Date Ladybug
#       (fctr)    (fctr)   (int)
# 1 Baumgarten 8/20/2014       3
# 2  Berkevich 8/22/2014       0
# 3      Boehm 6/24/2014       2

Upvotes: 3

Related Questions