R: Creating multiple resampled dataset based on multiple factors

Question

I need to create multiple (several 1000) resampled datasets from a large database. I have three categorical variables. Site (S), Transect(T), Quadrat(Q). The response variable is Value (V), which is the result of the particular S, T, & Q combination. Quads along each transect at each site. I pasted an abbreviated dataset below.

S   T   Q   V
A   1   1   8
A   1   2   5
A   1   3   0
A   2   1   0
A   2   2   15
A   2   3   0
A   3   1   0
A   3   2   25
A   3   3   0
B   1   1   0
B   1   2   1
B   1   3   0
B   2   1   33
B   2   2   1
B   2   3   2
B   3   1   0
B   3   2   207
B   3   3   0
C   1   1   0
C   1   2   1
C   1   3   0
C   2   1   45
C   2   2   33
C   2   3   0
C   3   1   0
C   3   2   1
C   3   3   0

The idea would be that for a given site, the resampled dataset would contain ## of quads from transect 1 to n, where ## would be the number of quadrats(Q) per transect (T) per site (S). I am not trying to resample the dataset based on S, T, & Q. I would like to be able to resample a user-defined number of rows, based on the conditions I define. For example, if I chose to resample using based on 2 quadrats(Q) per transect (T) per site(S), I envision the resampled dataset looking like the below example.

S   T   Q   V
A   1   1   8
A   1   3   0
A   2   1   0
A   2   2   15
A   3   2   25
A   3   3   0
B   1   2   1
B   1   3   0
B   2   2   1
B   2   3   2
B   3   1   0
B   3   2   207
C   1   1   0
C   1   3   0
C   2   1   45
C   2   3   0
C   3   2   1
C   3   3   0

Please let me know if that doesn't make sense and I'll revise until it does. Thanks for any assistance!

Parfait · Accepted Answer

Consider by to slice dataframes by Site and Transect factors and then sample random rows:

set.seed(444)
quads <- 2

# BUILD LIST OF SUBSETTED RANDOM SAMPLED DATAFRAMES 
df_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), quads),])

# STACK ALL DATAFRAMES INTO ONE FINAL DF
sample_df <- do.call(rbind, df_list)

# SORT DATAFRAME BY S AND T    
sample_df <- with(sample_df, sample_df[order(S, T),])

# RESET ROW NAMES
row.names(sample_df) <- NULL

sample_df
#    S T Q   V
# 1  A 1 1   8
# 2  A 1 3   0
# 3  A 2 2  15
# 4  A 2 1   0
# 5  A 3 1   0
# 6  A 3 3   0
# 7  B 1 2   1
# 8  B 1 1   0
# 9  B 2 3   2
# 10 B 2 1  33
# 11 B 3 1   0
# 12 B 3 2 207
# 13 C 1 1   0
# 14 C 1 2   1
# 15 C 2 1  45
# 16 C 2 3   0
# 17 C 3 3   0
# 18 C 3 2   1

Data

txt = '
S   T   Q   V
A   1   1   8
A   1   2   5
A   1   3   0
A   2   1   0
A   2   2   15
A   2   3   0
A   3   1   0
A   3   2   25
A   3   3   0
B   1   1   0
B   1   2   1
B   1   3   0
B   2   1   33
B   2   2   1
B   2   3   2
B   3   1   0
B   3   2   207
B   3   3   0
C   1   1   0
C   1   2   1
C   1   3   0
C   2   1   45
C   2   2   33
C   2   3   0
C   3   1   0
C   3   2   1
C   3   3   0'

df = read.table(text=txt, header=TRUE)

To build randomly generated dataframes, simply extend out quads and run it through lapply:

max_quads <- 3
quads <- replicate(1000, sample(1:max_quads, 1))

df_list <- lapply(quads, function(q) {

  by_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), q),]))    
  sample_df <- do.call(rbind, by_list)

  sample_df <- with(sample_df, sample_df[order(S, T),])
  row.names(sample_df) <- NULL

  return(sample_df)

})

R: Creating multiple resampled dataset based on multiple factors

Answers (1)

Related Questions