Reputation: 93
Hi I matched a sample of individuals from the treatment group and the control group by using the genetic matching approach from the MatchIt
package. The outcome variables is the average spending of each month. Now I'm trying to generate a line graph to demonstrate patterns of average monthly spending of the treatment and control groups. I would like to add shaded areas to show the bootstrap confidence interval for these lines. The main issue that I encountered include:
1. Bootstrap in pairs
Bootstrap resampling for matched sample should be conducted based on each matched pair, not each individual. Thankfully I've got the solution from this post. The code will be provided in the example.
2. Plotting bootstrap confidence interval
I'm not sure how I can plot the bootstrap confidence intervals on the graphs.
Here I provide an example:
library(dplyr)
library(tidyr)
library(ggplot2)
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L")
Pair <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
Treatment <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
Month_1 <- c(300, 150, 450, 100, 200, 300, 400, 600, 650, 150, 200, 400)
Month_2 <- c(400, 600, 650, 150, 200, 400, 500, 250, 700, 200, 300, 500)
Month_3 <- c(500, 250, 700, 200, 300, 500, 500, 250, 700, 200, 300, 500)
Month_4 <- c(600, 700, 650, 250, 500, 550, 600, 700, 650, 250, 500, 550)
Month_5 <- c(700, 200, 800, 300, 900, 800, 600, 700, 650, 250, 500, 550)
df <- data.frame(ID, Pair, Treatment, Month_1, Month_2, Month_3, Month_4, Month_5)
> df
ID Pair Treatment Month_1 Month_2 Month_3 Month_4 Month_5
1 A 1 1 300 400 500 600 700
2 B 1 0 150 600 250 700 200
3 C 2 1 450 650 700 650 800
4 D 2 0 100 150 200 250 300
5 E 3 1 200 200 300 500 900
6 F 3 0 300 400 500 550 800
7 G 4 1 400 500 500 600 600
8 H 4 0 600 250 250 700 700
9 I 5 1 650 700 700 650 650
10 J 5 0 150 200 200 250 250
11 K 6 1 200 300 300 500 500
12 L 6 0 400 500 500 550 550
In this dataset, "Month_1-5" denote the monthly spending of each individual; "Treatment" denotes if the individual is in the treatment group (coded as 1) or control group (coded as 0); and "Pair" indicates each matched pair. For example, individual A and B is in the sample pair, because they share the same Pair number of "1". Therefore, after bootstrap resampling, if A appears in any sample, B should appear as well.
Here I provide a code to conduct bootstrap resampling and to calculate mean spending for each month across treatment and control groups:
### 95% paired bootstrap confidence interval
library(boot)
df <- df %>%
mutate(Pair = as.factor(Pair))
pair_ids <- levels(df$Pair)
# For mean of Month_1 in control group
est_control <- function(pairs, i) {
# Compute number of times each pair is present
numreps <- table(pairs[i])
# For each pair p, copy corresponding data indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
function(p) rep(which(df$Pair == p),
numreps[p])))
# Subset df with paired bootstrapped ids
md_boot <- df[ids,]
# Estimation
boot_control <- mean(md_boot[md_boot$Treatment == 0, "Month_1"])
# Return the mean
return(boot_control)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_fun, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")
# For mean of Month_1 in treatment group
est_treat <- function(pairs, i) {
# Compute number of times each pair is present
numreps <- table(pairs[i])
# For each pair p, copy corresponding data indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
function(p) rep(which(df$Pair == p),
numreps[p])))
# Subset df with paired bootstrapped ids
md_boot <- df[ids,]
# Estimation
boot_treat <- mean(md_boot[md_boot$Treatment == 1, "Month_1"])
# Return the mean
return(boot_treat)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_treat, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")
In this case, the bootstrap confidence interval of "Month_1" for control group is [158.3, 441.7], and that for treatment group is [250, 500].
Then I calculate the average monthly spending for the control and treatment groups and then generate the line graph:
Monthly_spending <- df %>%
group_by(Treatment) %>%
summarise("1" = mean(Month_1, na.rm = T),
"2" = mean(Month_2, na.rm = T),
"3" = mean(Month_3, na.rm = T),
"4" = mean(Month_4, na.rm = T),
"5" = mean(Month_1, na.rm = T)) %>%
pivot_longer(c("1":"5"), names_to = "Month", values_to = "Spending")
Monthly_spending %>%
mutate(Treatment = as.factor(Treatment)) %>%
ggplot(aes(x = Month, y = Spending, group = Treatment, color = Treatment)) +
geom_point() +
geom_line(aes(group = Treatment)) +
ggtitle("Monthly Spending") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_colour_discrete("Groups", labels = c("Control", "Treatment")) +
theme(legend.position = "bottom")
However, I don't know how to add shaded areas around these two lines to indicate the bootstrap confidence intervals from the code that I provided above. Alternatively, I could run the code for each time point again and again, and then document that in the dataset manually to plot the intervals on the graphs. However, I wonder if there's a way to conduct it more efficiently. I would be really grateful for your help.
Upvotes: 0
Views: 565
Reputation: 173813
This seems like a very complex way to achieve your end goal. The ggplot2 function mean-cl_boot
takes an input vector and returns y, ymin and ymax values calculated by a fast bootstrap method. It works directly inside summarize
too, so we can replace your whole code with:
df %>%
pivot_longer(starts_with("Month"), names_to = "Month") %>%
group_by(Month, Treatment) %>%
summarize(mean_cl_boot(value)) %>%
mutate(Month = as.numeric(sub("Month_", "", Month)),
Treatment = c("Control", "Treatment")[Treatment + 1]) %>%
ggplot(aes(Month, y, color = Treatment)) +
geom_ribbon(aes(fill = Treatment, ymin = ymin, ymax = ymax),
alpha = 0.1, color = NA) +
geom_line(size = 1) +
geom_point(size = 3) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Monthly spending") +
theme_minimal(base_size = 16) +
theme(legend.position = "bottom")
Upvotes: 0