Mr K
Mr K

Reputation: 446

bootstrap a dataset in R

I need to perform bootstrapping for a dataset in R. The data is in the form of a list which contains two matrices and has following properties:

My goal is to estimate the parameters which lead to underlying probabilities. My model assumes 2n parameters, n for the number of rows and one set for each matrix. The parameters combine in complex manner and so the two matrices must be analyzed together.

    parameters <- data.frame(a=numeric(n), b=numeric(n)) 

Right now, this is the approach I am using:

  1. I define a function SGen which takes for input a matrix containing probabilities associated with all sites, generates a dataset using these probabilities and returns it.

    SGen <- function(freq) {
       #generate sample
       ...
    }
    
  2. For non-parametric bootstrap (which is what I want to implement for now), I run an experiment, and calculate the observed probability associated with each ij element by dividing the observed matrices with 10000. Let us call it freq for now. So, freq is a list with two matrices.

    freq <- list(a=data$a/10000, b=data$b/10000)
    
  3. Next, I replicate 100 samples for data by passing freq to SGen.
  4. I pass all the replicates to another pre-defined function, analyze which gives me 100 n by 2 matrices containing the parameters.
  5. Next I calculate the mean and sd of the entries between all the matrices to get an n by 2 matrix containing means and another containing sd. So, the desired value for say (1,5)th element of mean matrix is the mean of (1,5)th elements of all 100 replicates.

While the approach works, I would like to use boot package in R for the job. I want to do that because then I can use all the functions in boot package for later analyses and I also like the way essential information is stored in format of boot class. Another important reason I would like to use boot package is that it offers an easy way to make use of multicore capabilities of my computer. So, can you please guide me on how to use boot for my purpose?

Upvotes: 2

Views: 1339

Answers (1)

ajerneck
ajerneck

Reputation: 751

You can use bootstrap with almost any function in the following way (taken from ?bootstrap):

# To bootstrap functions of more complex data structures, 
# write theta so that its argument x
# is the set of observation numbers  
# and simply  pass as data to bootstrap the vector 1,2,..n. 
# For example, to bootstrap
# the correlation coefficient from a set of 15 data pairs:
xdata <- matrix(rnorm(30),ncol=2)
n <- 15
theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) }
results <- bootstrap(1:n,20,theta,xdata)

theta is the function to bootstrap.

The only problem with this approach (I believe) is that theta can only return a vector (not a dataframe/matrix of multiple values in one go). So, if your theta function returns anything else than a vector it might not work.

Update for the boot package:

The approach would be similar using the boot function from the boot package. It takes data, the data as a vector, matrix, or dataframe, and statistic, "a function which when applied to data returns a vector containing the statistic(s) of interest." For non-parametric bootstrap, the statistic function must take (at least) two arguments: the original data, and a vector of indices, frequencies or weights.

So, the key is to write one function that implements steps 1-5 on the subset of the data given by the index, e.g:

theta <- function(data, indices) {
    ## exact subsetting operation depends on the format of data
    subset_data = data[indices,]
    ## perform the calculations in steps 1-5 here on subset_data
}

Then you should be able to call theta like this:

boot(data, theta)

Upvotes: 0

Related Questions