Bootstrapping in R (with replacement) whilst retaining dependency for two variables

Question

Background and context

I am new to R, but I have some basic understanding of how to run a bootstrap procedure for individual variables. However, from the online guides I have looked at, the examples that are used only include a single variable and their outcome ends up being a histogram that includes the generated means from all the resampling and the frequency.

I am looking to perform a bootstrap of my sample where my data is dependent on two variables (participant age & test score). I understand how I could bootstrap my variables independently so that I can bootstrap age or score, but given that participants of the same age sometimes get different scores, I am not sure how I would be able to determine which score corresponds with the age that is bootstrapped.

For example, a 20-year-old participant has a score of 50, and a second 20-year-old has a score of 70, and these are within my data. If I were to run a bootstrap with replacement based on age, it is possible that one of the 20-year-olds will be selected and replaced back into the dataset. However, I do not know what their corresponding score would be - i.e., I do not know whether the one who scored 50 or the one who scored 70 was selected.

Others I have asked mention I might need to extract age and score together, corresponding to a single row, to retain the dependency between the two. The data file I have on R is a row for each participant, with age in one column and score in another.

What am I looking for?

The end goal of the bootstrapping is to resample (with replacement) my data 200 times to give me 200 "different" sets of data, which I can put into a quadratic function to determine the vertex of the graph. These 200 values will be combined to generate a mean and standard error.

Having little experience with R coding, I have not tried a great deal other than understanding the basics of bootstrapping (with replacement).

I am aware that it is possible to mutate/merge data, but I do not believe it fits with this. I am not sure of how to proceed, and any support (sources of information or where I can look etc.) would be greatly appreciated.

Gerald T · Accepted Answer

You could run the resampling on the indices.

For example:

set.seed(1)
df <- data.frame(age = rep( seq(20,50,10), each=2), score = sample(50:70, 8))

  age score
1  20    68
2  20    62
3  30    53
4  30    67
5  40    66
6  40    55
7  50    51
8  50    65

Resample:

df[sample( seq_len(nrow(df) ), nrow(df), replace = TRUE), ]

    age score
6    40    55
4    30    67
1    20    68
7    50    51
1.1  20    68
5    40    66
1.2  20    68
3    30    53

Bootstrapping in R (with replacement) whilst retaining dependency for two variables

Answers (1)

Related Questions