Reputation: 995
my reprex works fine, but I have the problem, that given very large datasets, the step where I create a long
df instead of wide takes a lot of time and pushes my RAM beyond 32 GB limits in real life.
further I run t-test function twice
I am looking for a way get the same output but more RAM / CPU efficient way. Any suggestions? I would prefer tidyverse solutions, but data.table would also be fine, if it helped reducing the load.
library(tidyverse)
library(stringi)
library(broom)
set.seed(101)
## creating dataset:
## 10 mln users, 50/50 split in experiment off vs on group
## two variables / measures
w <- tibble(
id=stri_rand_strings(15000000, 10),
variant=rep(c("off", "a", "b"), each=5000000),
variable_a=c(rnorm(n=5000000, mean = 2, sd=1),rnorm(n=5000000, mean = 3, sd=1), rnorm(n=5000000, mean = 3, sd=2)),
variable_b=c(rnorm(n=5000000, mean = 10, sd=2),rnorm(n=5000000, mean = 10, sd=2), rnorm(n=5000000, mean = 10, sd=5))
)
## creating the long data format
## costs RAM (+ 50 %) and time
## Q: is there a way to improve this?
w <- w%>%
gather(variable, values, 3:4)
## creating a t.test function that runs on long data format
p_values <- function(data, control="off", treatment="on"){
data%>%
## grouping by variable allows to run t.test for each variable
group_by(variable)%>%
do(tidy(with(data = ., t.test(values[variant == control], values[variant == treatment]))))%>%
select(variable, p.value)%>%
mutate(p.value=round(p.value,3))%>%
mutate(variant = treatment)
}
## running the function
## Q: is there a way to improve this?
p_a <- p_values(w, control = "off", treatment = "a")
p_b <- p_values(w, control = "off", treatment = "b")
p <- rbind(p_a, p_b)
## diplsying the results and adding the p values
w %>%
group_by(variant, variable)%>%
summarise(avg=mean(values, na.rm=TRUE))%>%
group_by(variable)%>%
mutate(lift=round((avg/avg[variant=="off"]-1)*100,3))%>%
left_join(p, by = c("variant", "variable"))%>%
pivot_wider(names_from = variant, values_from = c(avg, lift, p.value))%>%
select(-c(lift_off, p.value_off))%>%
relocate(variable, ends_with(c("off","a", "b")))
#> `summarise()` has grouped output by 'variant'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 8
#> # Groups: variable [2]
#> variable avg_off avg_a lift_a p.value_a avg_b lift_b p.value_b
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 variable_a 2.00 3.00 50.1 0 3.00 50.2 0
#> 2 variable_b 10.0 10.0 -0.024 0.053 10.0 -0.012 0.624
Created on 2022-08-31 by the reprex package (v2.0.1)
Upvotes: 0
Views: 50
Reputation: 4725
If long format is the problem, then I would just work with data in wide format you already have.
Here I've rewritten your p_values
function to work with your initial data format before calling gather
:
p_values <- function(data, control="off", treatment="on") {
p.val <- sapply(grep("^variable", names(data), value=T), function(var) {
t.test(data[[var]][data$variant==control],
data[[var]][data$variant==treatment])$p.value
})
tibble(variable=names(p.val), p.value=round(p.val, 3), variant=treatment)
}
Upvotes: 1