Reputation: 57
I am trying to pull and find the mean of Yi0 data if V1 = 0, and the mean of Yi1 data if V1=1, for each Vx (V1...V2 etc) column, but my code seems to flawed, does anyone has suggestions on how I could fix this?
This is my sample data and code
set.seed(1)
df <- data.frame(Yi0 = runif(n=10, min = 0, max = 10),
Yi1 = runif(n=10, min = 0, max = 10),
V1 = c(1,1,1,1,1,0,0,0,0,0),
V2 = c(0,1,0,1,0,1,0,1,0,1))
pm <- function(x) {
as.numeric(
ifelse(
test = df[,x] == 0,
yes =mean(df["Yi0"]),
no = ifelse(
test = df[,x] == 1,
yes = mean(df["Yi1"]),
no = "error")))
}
ab <- sapply(X = 3:4, FUN=pm)
ab
Ultimately, I will want to take mean(2.059 + 1.765 + ... + 7.69) - mean(8.9838+9.446+...+0.6178), for each of the column starting with V, I have a total of 2 million of them...
thank you
Upvotes: 0
Views: 57
Reputation: 72683
Unless somebody comes up with a data.table
solution, vapply()
combined with pm()
should be fastest.
pm <- function(x) mean(DF$Yi0[x == 0]) - mean(DF$Yi1[x == 1])
vapply(X=DF[grep('^V', names(DF))], FUN=pm, FUN.VALUE=0)
# V1 V2
# 1.9425275 -0.4277555
Benchmark:
n <- 1e5 - 2
set.seed(1)
v <- c(1,1,1,1,1,0,0,0,0,0)
DF <- cbind(DF, `colnames<-`(replicate(n, sample(v)), paste0('V', seq(n) + 2)))
microbenchmark::microbenchmark(
sapply=sapply(DF[-(1:2)], function(x) mean(DF$Yi0[x == 0]) - mean(DF$Yi1[x == 1])),
vapply=vapply(DF[grep('^V', names(DF))], pm, 0),
dapply=collapse::dapply(get_vars(DF, vars = "^V\\d+", regex = TRUE), MARGIN = 2,
FUN = function(x) fmean(DF$Yi0[!x] - fmean(DF$Yi1[x == 1]))),
times=3L,
control=list(warmup=10L))
# Unit: seconds
# expr min lq mean median uq max neval cld
# sapply 1.508540 1.552917 1.568816 1.597293 1.598954 1.600615 3 a
# vapply 1.432234 1.450200 1.502311 1.468165 1.537350 1.606534 3 a
# dapply 2.250366 2.257978 2.308068 2.265590 2.336918 2.408247 3 b
Upvotes: 1
Reputation: 887048
Based on the updated post, we can reshape to 'long' with pivot_longer
and take the difference in mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = V1:V2) %>%
group_by(name) %>%
summarise(Diff = mean(Yi0[value == 0]) - mean(Yi1[value == 1]))
-output
# A tibble: 2 × 2
name Diff
<chr> <dbl>
1 V1 1.94
2 V2 -0.428
If there are many columns, an efficient approach would be to use collapse
library(collapse)
dapply(get_vars(df, vars = "^V\\d+", regex = TRUE), MARGIN = 2,
FUN = function(x) fmean(df$Yi0[!x] - fmean(df$Yi1[x == 1])))
V1 V2
1.9425275 -0.4277555
Upvotes: 1
Reputation: 388907
If I understand you correctly, for one column (V1
) you want to do
mean(df$Yi0[df$V1 == 0]) - mean(df$Yi1[df$V1 == 1])
#[1] 1.9
For multiple columns using sapply
you can do -
sapply(df[3:4], function(x) mean(df$Yi0[x == 0]) - mean(df$Yi1[x == 1]))
# V1 V2
# 1.94 -0.43
Upvotes: 1