Reputation: 181
I have a very large data set in tibble form. I'd like to summarize the data using some functions which return lists. I'm interested in several components of the list, and I'd like to return each of the components I need into new tibble columns.
Here's an example
library(tibble)
library(dplyr)
# Create a data set of 1,000 random values in 100 subgroups with sample size 10
contrived_data <- tibble(subgroup = rep(1:100, each = 10),
value = rnorm(1000, mean = 5, sd = 1))
# Run the KS test vs. normal distribution on each sample of size 10. Return the KS statistic and p-value
# into new tibble columns
contrived_data %>% group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
ks_stat = ks.test(value, "pnorm", mean = 5, sd = 1)$statistic,
ks_pval = ks.test(value, "pnorm", mean = 5, sd = 1)$p.value)
Running it this way gets the results I want, but not very efficiently. Calling the ks.test
function twice means the execution time is (almost) doubled. It seems there must be a more efficient way to extract these two list components with a single function call, but I don't know how to do that.
Upvotes: 2
Views: 1819
Reputation: 28675
You can use group_modify
library(tidyverse)
contrived_data %>%
group_by(subgroup) %>%
group_modify(~{
ks <- ks.test(.$value, "pnorm", mean = 5, sd = 1)
tibble(
avg = mean(.$value),
std_dev = sd(.$value),
ks_stat = ks$statistic,
ks_pval = ks$p.value)
})
Or with data.table
library(data.table)
setDT(contrived_data)
contrived_data[, {
ks <- ks.test(value, "pnorm", mean = 5, sd = 1)
.(avg = mean(value),
std_dev = sd(value),
ks_stat = ks$statistic,
ks_pval = ks$p.value)
}, by = subgroup]
Upvotes: 2
Reputation: 46888
you can define the function and use map from purrr:
library(tibble)
library(dplyr)
library(purrr)
func = function(DA){
kstest = ks.test(DA$value, "pnorm", mean = 5, sd = 1)
data.frame(
subgroup = unique(DA$subgroup),
avg=mean(DA$value),
std_dev = sd(DA$value),
ks_stat = kstest$statistic,
ks_pval = kstest$p.value)
}
contrived_data %>%
split(.$subgroup) %>%
map_dfr(func)
Upvotes: 4
Reputation: 2011
A dplyr
solution using the rowwise
command which performs the same task as map
does.
contrived_data %>%
group_by(subgroup) %>%
summarise(
avg = mean(value),
std_dev = sd(value),
ks_test = list(ks.test(value,"pnorm",mean=5,sd=1))
) %>%
ungroup() %>%
rowwise() %>%
mutate(
ks_stat = ks_test$statistic,
ks_pval = ks_test$p.value
) %>%
ungroup() %>%
select(-ks_test)
# A tibble: 100 x 5
# subgroup avg std_dev ks_stat ks_pval
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.10 1.24 0.186 0.819
# 2 2 4.86 0.805 0.231 0.584
# 3 3 5.24 0.729 0.258 0.445
# 4 4 5.16 0.642 0.307 0.247
# 5 5 4.63 0.752 0.393 0.0664
# Benchmark using rbenchmark:
# test replications elapsed relative user.self sys.self user.child sys.child
#2 nested 1000 10.58 1.000 10.58 0 NA NA
#1 original 1000 16.75 1.583 16.73 0 NA NA
Upvotes: 3
Reputation: 886938
The test can be run once and wrapped in a list
and then use map
(from purrr
) to extract the values
library(purrr)
library(dplyr)
library(tidyr)
contrived_data %>%
group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
test = list(ks.test(value, "pnorm", mean = 5, sd = 1))) %>%
mutate(out = map(test, ~ tibble(ks_stat = .x$statistic,
ks_pval = .x$p.value))) %>%
unnest_wider(c(out)) %>%
select(-test)
# A tibble: 100 x 5
# subgroup avg std_dev ks_stat ks_pval
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 4.52 0.675 0.375 0.0907
# 2 2 5.17 1.02 0.342 0.152
# 3 3 5.02 0.909 0.141 0.972
# 4 4 5.08 0.846 0.313 0.227
# 5 5 4.82 0.819 0.225 0.614
# 6 6 5.07 0.866 0.159 0.928
# 7 7 4.94 0.914 0.145 0.966
# 8 8 5.52 1.01 0.290 0.306
# 9 9 5.17 0.787 0.258 0.443
#10 10 4.61 1.15 0.476 0.0132
# … with 90 more rows
Another option is to tidy
the output (with broom
) and extract all the components at once
library(broom)
contrived_data %>%
group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
out = list(tidy(ks.test(value, "pnorm", mean = 5, sd = 1)))) %>%
unnest_wider(c(out))
Upvotes: 3