kiyoshi sasaki
kiyoshi sasaki

Reputation: 35

Subset data by randomly sampling one Site per Region but keeping rows of different Year

This is an extended qestion from Randomly sample per group, make a new dataframe, repeat until all entities within a group are sampled

From an example data below, I want to produce multiple data frames by randomly sampling one Site from every Region. To make another data frame, take another random sample of Site without replacement; that is, the same Site of a given Region that were sampled in any previous sampling cannot be sampled. So, there will be as many data frames as the number of sites within regions. This part of my question was answered in the link above (although I could not find a check mark to accept that answer in that website).

My question here is for my another data frame that have data from multiple years for a given site. I want each data frame to contain unique Region-Site combination (answered in the link above) but having data from all years. Here is an example data (there are some differences in the number of years and sites for a given region):

mydf <- read.table(header = TRUE, text = 'V1 V2 Region Site Year
  5 1 A X1 2000
  1 1 A X1 2001
  5 6 A X2 2000
  2 2 A X2 2001
  8 9 A X3 2000
  5 5 A X3 2001
  3 3 B X1 2000
  2 3 B X1 2001
  3 1 B X2 2000
  4 4 B X2 2001
  7 8 B X3 2000
  1 2 C X1 2000
  9 4 C X1 2001
  4 5 C X2 2000
  6 7 C X2 2001')

Here are some expected data frames:

V1 V2 Region Site Year
5  1      A   X1 2000
1  1      A   X1 2001
3  1      B   X2 2000
4  4      B   X2 2001
1  2      C   X1 2000
9  4      C   X1 2001

V1 V2 Region Site Year
8  9      A   X3 2000
5  5      A   X3 2001
3  3      B   X1 2000
2  3      B   X1 2001
4  5      C   X2 2000
6  7      C   X2 2001

I tried to modify code provided in the link above, but it did not work. Here is the code I tried

library(data.table)
dt <- setDT(mydf)
dt <- dt[sample(.N)]
dt <- unique(dt, by = c('Year','Region'))
dt[, .SD[1], by=c("Region","Year")]

Upvotes: 1

Views: 52

Answers (1)

akrun
akrun

Reputation: 887301

As there are not duplicate 'Year' for each 'Region/Site' combination, after converting to 'data.table' (setDT(mydf)), grouped by 'Region', we sample the unique elements of 'Site', get the row index (.I) where the sampled element is equal to the 'Site', extract the row index ($V1), use it to subset the rows of the dataset

setDT(mydf)[mydf[,  .I[Site ==sample(unique(Site), 1)], .(Region)]$V1]
#   V1 V2 Region Site Year
#1:  5  1      A   X1 2000
#2:  1  1      A   X1 2001
#3:  3  1      B   X2 2000
#4:  4  4      B   X2 2001
#5:  1  2      C   X1 2000
#6:  9  4      C   X1 2001

If we need to replicate this, we can use replicate

setDT(mydf)
lst <- replicate(5, mydf[mydf[,  .I[Site ==sample(unique(Site), 1)],
                .(Region)]$V1], simplify = FALSE)

Update

If we need to remove the 'Site' that already occurred, then use a for loop to update the original dataset with only rows that are not already sampled while we create a list of data.table ('lst1') with sampled 'Site' per 'Region'

setDT(mydf)
mydf1 <- copy(mydf)
lst1 <- vector("list", 3)
for(i in 1:3){
  tmp <- mydf[, .(Site = sample(unique(Site), 1)), Region]
  lst1[[i]] <-  mydf[tmp, on = .(Region, Site)]
   mydf <- mydf[mydf[tmp, Site != i.Site, on = .(Region)]]
 } 

lst1
#[[1]]
#   V1 V2 Region Site Year
#1:  5  6      A   X2 2000
#2:  2  2      A   X2 2001
#3:  3  3      B   X1 2000
#4:  2  3      B   X1 2001
#5:  4  5      C   X2 2000
#6:  6  7      C   X2 2001

#[[2]]
#   V1 V2 Region Site Year
#1:  5  1      A   X1 2000
#2:  1  1      A   X1 2001
#3:  7  8      B   X3 2000
#4:  1  2      C   X1 2000
#5:  9  4      C   X1 2001

#[[3]]
#   V1 V2 Region Site Year
#1:  8  9      A   X3 2000
#2:  5  5      A   X3 2001
#3:  3  1      B   X2 2000
#4:  4  4      B   X2 2001

Upvotes: 1

Related Questions