ppotatomato
ppotatomato

Reputation: 57

How can I modify Sapply to find the mean in other columns

I am trying to pull and find the mean of Yi0 data if V1 = 0, and the mean of Yi1 data if V1=1, for each Vx (V1...V2 etc) column, but my code seems to flawed, does anyone has suggestions on how I could fix this?

This is my sample data and code

set.seed(1)
    df <- data.frame(Yi0 =  runif(n=10, min = 0, max = 10),
                 Yi1 = runif(n=10, min = 0, max = 10),
                 V1 = c(1,1,1,1,1,0,0,0,0,0),
                 V2 = c(0,1,0,1,0,1,0,1,0,1))

enter image description here

pm <- function(x) {
  as.numeric(
    ifelse(
      test = df[,x] == 0,
      yes =mean(df["Yi0"]),
      no = ifelse(
        test = df[,x] == 1,
        yes = mean(df["Yi1"]),
        no = "error")))
} 

ab <- sapply(X = 3:4, FUN=pm)
ab

Ultimately, I will want to take mean(2.059 + 1.765 + ... + 7.69) - mean(8.9838+9.446+...+0.6178), for each of the column starting with V, I have a total of 2 million of them...

thank you

this is the original data set enter image description here

Upvotes: 0

Views: 57

Answers (3)

jay.sf
jay.sf

Reputation: 72683

Unless somebody comes up with a data.table solution, vapply() combined with pm() should be fastest.

pm <- function(x) mean(DF$Yi0[x == 0]) - mean(DF$Yi1[x == 1])
vapply(X=DF[grep('^V', names(DF))], FUN=pm, FUN.VALUE=0)
#        V1         V2 
# 1.9425275 -0.4277555 

Benchmark:

n <- 1e5 - 2
set.seed(1)
v <- c(1,1,1,1,1,0,0,0,0,0)
DF <- cbind(DF, `colnames<-`(replicate(n, sample(v)), paste0('V', seq(n) + 2)))
microbenchmark::microbenchmark(
  sapply=sapply(DF[-(1:2)], function(x) mean(DF$Yi0[x == 0]) - mean(DF$Yi1[x == 1])),
  vapply=vapply(DF[grep('^V', names(DF))], pm, 0),
  dapply=collapse::dapply(get_vars(DF, vars = "^V\\d+", regex = TRUE), MARGIN = 2,
         FUN = function(x) fmean(DF$Yi0[!x] - fmean(DF$Yi1[x == 1]))),
  times=3L,
  control=list(warmup=10L))
# Unit: seconds
#   expr      min       lq     mean   median       uq      max neval cld
# sapply 1.508540 1.552917 1.568816 1.597293 1.598954 1.600615     3  a 
# vapply 1.432234 1.450200 1.502311 1.468165 1.537350 1.606534     3  a 
# dapply 2.250366 2.257978 2.308068 2.265590 2.336918 2.408247     3   b

Upvotes: 1

akrun
akrun

Reputation: 887048

Based on the updated post, we can reshape to 'long' with pivot_longer and take the difference in mean

library(dplyr)
library(tidyr)
df %>% 
    pivot_longer(cols = V1:V2) %>% 
    group_by(name) %>% 
    summarise(Diff = mean(Yi0[value == 0]) - mean(Yi1[value == 1]))

-output

# A tibble: 2 × 2
  name    Diff
  <chr>  <dbl>
1 V1     1.94 
2 V2    -0.428

If there are many columns, an efficient approach would be to use collapse

library(collapse)
dapply(get_vars(df, vars = "^V\\d+", regex = TRUE), MARGIN = 2,
    FUN = function(x) fmean(df$Yi0[!x] - fmean(df$Yi1[x == 1])))
        V1         V2 
 1.9425275 -0.4277555 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388907

If I understand you correctly, for one column (V1) you want to do

mean(df$Yi0[df$V1 == 0]) - mean(df$Yi1[df$V1 == 1])
#[1] 1.9

For multiple columns using sapply you can do -

sapply(df[3:4], function(x) mean(df$Yi0[x == 0]) - mean(df$Yi1[x == 1]))

#   V1    V2 
# 1.94 -0.43 

Upvotes: 1

Related Questions