Reputation: 446
I need to perform bootstrapping for a dataset in R. The data is in the form of a list which contains two matrices and has following properties:
Both the matrices are n by m and contain only positive integers (including 0).
data <- list(a=matrix(,n,m), b=matrix(,n,m))
A number of marbles, say 10000 are distributed into each matrix, i.e., 10000 is divided in n*m parts. In other words, sum of all entries for each matrix is fixed.
> sum(data$a)
[1] 10000
> sum(data$b)
[1] 10000
My goal is to estimate the parameters which lead to underlying probabilities. My model assumes 2n
parameters, n
for the number of rows and one set for each matrix. The parameters combine in complex manner and so the two matrices must be analyzed together.
parameters <- data.frame(a=numeric(n), b=numeric(n))
Right now, this is the approach I am using:
I define a function SGen
which takes for input a matrix containing probabilities associated with all sites, generates a dataset using these probabilities and returns it.
SGen <- function(freq) {
#generate sample
...
}
For non-parametric bootstrap (which is what I want to implement for now), I run an experiment, and calculate the observed probability associated with each ij element by dividing the observed matrices with 10000. Let us call it freq
for now. So, freq
is a list with two matrices.
freq <- list(a=data$a/10000, b=data$b/10000)
freq
to SGen
.analyze
which gives me 100 n by 2 matrices containing the parameters.While the approach works, I would like to use boot package in R for the job. I want to do that because then I can use all the functions in boot package for later analyses and I also like the way essential information is stored in format of boot class. Another important reason I would like to use boot package is that it offers an easy way to make use of multicore capabilities of my computer. So, can you please guide me on how to use boot
for my purpose?
Upvotes: 2
Views: 1339
Reputation: 751
You can use bootstrap
with almost any function in the following way (taken from ?bootstrap
):
# To bootstrap functions of more complex data structures,
# write theta so that its argument x
# is the set of observation numbers
# and simply pass as data to bootstrap the vector 1,2,..n.
# For example, to bootstrap
# the correlation coefficient from a set of 15 data pairs:
xdata <- matrix(rnorm(30),ncol=2)
n <- 15
theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) }
results <- bootstrap(1:n,20,theta,xdata)
theta
is the function to bootstrap.
The only problem with this approach (I believe) is that theta can only return a vector (not a dataframe/matrix of multiple values in one go). So, if your theta
function returns anything else than a vector it might not work.
boot
package:The approach would be similar using the boot
function from the boot
package. It takes data
, the data as a vector, matrix, or dataframe, and statistic
, "a function which when applied to data returns a vector containing the statistic(s) of interest." For non-parametric bootstrap, the statistic
function must take (at least) two arguments: the original data, and a vector of indices, frequencies or weights.
So, the key is to write one function that implements steps 1-5 on the subset of the data given by the index, e.g:
theta <- function(data, indices) {
## exact subsetting operation depends on the format of data
subset_data = data[indices,]
## perform the calculations in steps 1-5 here on subset_data
}
Then you should be able to call theta
like this:
boot(data, theta)
Upvotes: 0