More RAM efficent t.test and data wrangling in R

Question

my reprex works fine, but I have the problem, that given very large datasets, the step where I create a long df instead of wide takes a lot of time and pushes my RAM beyond 32 GB limits in real life.
further I run t-test function twice

I am looking for a way get the same output but more RAM / CPU efficient way. Any suggestions? I would prefer tidyverse solutions, but data.table would also be fine, if it helped reducing the load.

library(tidyverse)
library(stringi)
library(broom)

set.seed(101)
## creating dataset: 
## 10 mln users, 50/50 split in experiment off vs on group
## two variables / measures
w <- tibble(
    id=stri_rand_strings(15000000, 10),
    variant=rep(c("off", "a", "b"), each=5000000),
    variable_a=c(rnorm(n=5000000, mean = 2, sd=1),rnorm(n=5000000, mean = 3, sd=1), rnorm(n=5000000, mean = 3, sd=2)),
    variable_b=c(rnorm(n=5000000, mean = 10, sd=2),rnorm(n=5000000, mean = 10, sd=2), rnorm(n=5000000, mean = 10, sd=5))
)

## creating the long data format
## costs RAM (+ 50 %) and time
## Q: is there a way to improve this?
w <- w%>%
    gather(variable, values, 3:4)


## creating a t.test function that runs on long data format
p_values <- function(data, control="off", treatment="on"){
    data%>% 
        ## grouping by variable allows to run t.test for each variable
        group_by(variable)%>%
        do(tidy(with(data = ., t.test(values[variant == control], values[variant == treatment]))))%>%
        select(variable, p.value)%>%
        mutate(p.value=round(p.value,3))%>%
        mutate(variant = treatment)
}

## running the function
## Q: is there a way to improve this?
p_a <- p_values(w, control = "off", treatment = "a")
p_b <- p_values(w, control = "off", treatment = "b")
p <- rbind(p_a, p_b)


## diplsying the results and adding the p values
w %>%
    group_by(variant, variable)%>%
    summarise(avg=mean(values, na.rm=TRUE))%>%
    group_by(variable)%>%
    mutate(lift=round((avg/avg[variant=="off"]-1)*100,3))%>%
    left_join(p, by = c("variant", "variable"))%>%
    pivot_wider(names_from = variant, values_from = c(avg, lift, p.value))%>%
    select(-c(lift_off, p.value_off))%>%
    relocate(variable, ends_with(c("off","a", "b")))
#> `summarise()` has grouped output by 'variant'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 8
#> # Groups:   variable [2]
#>   variable   avg_off avg_a lift_a p.value_a avg_b lift_b p.value_b
#>                           
#> 1 variable_a    2.00  3.00 50.1       0      3.00 50.2       0    
#> 2 variable_b   10.0  10.0  -0.024     0.053 10.0  -0.012     0.624

^{Created on 2022-08-31 by the reprex package (v2.0.1)}

Robert Hacken · Accepted Answer

If long format is the problem, then I would just work with data in wide format you already have.

Here I've rewritten your p_values function to work with your initial data format before calling gather:

p_values <- function(data, control="off", treatment="on") {
  
  p.val <- sapply(grep("^variable", names(data), value=T), function(var) {
    t.test(data[[var]][data$variant==control],
           data[[var]][data$variant==treatment])$p.value
  })
  
  tibble(variable=names(p.val), p.value=round(p.val, 3), variant=treatment)
}

More RAM efficent t.test and data wrangling in R

Answers (1)

Related Questions