How to create confidence intervals for multiple columns using purrr and dplyr packages in R?

Question

I would love to have your inputs about how to use the purrr package for multiple columns. In specific, I want to do some basic operations to create the confidence interval (lower and upper) for each of the variables mass and birth_year by skin_color, from the starwars database.

What I have done so far is:

Calculate the number of observations different to NA foreach of the columns of my interest (mass and birth_year) by skin_color.

pacman::p_load("tidyr","purrr")

data <- starwars
data_obs <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color) %>%
  dplyr::group_by(skin_color)%>%
  dplyr::summarise_all(funs(sum(!is.na(.))))%>%
  dplyr::ungroup()

I created a database that estimates the mean and standard deviation foreach variable of interest by skin_color.

data_stats <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color)%>%
  dplyr::group_by(skin_color) %>%
  dplyr::summarise_all(., list(mean,sd)
                       , na.rm=T
                       )%>%
  dplyr::ungroup()

I merged both databases and in that way I have the number of observations different from NA, the mean, and the sd foreach of the columns.

data_complete <-
  dplyr::inner_join(data_obs,data_stats, by="skin_color")

From here, it is easy to estimate the standard error foreach variable manually by:

data_complete <-
  dplyr::mutate(mass_se = mass_sd/sqrt(mass_n), 
                mass_ci_upper = mass_mean + qt(1 - (0.05 / 2), mass_n - 1)*mass_se,
                mass_ci_lower = mass_mean - qt(1 - (0.05 / 2), mass_n - 1)*mass_se)

However, since this is a lot of work for my real dataset (with more than 50 variables), I would like to use the purrr package. I have tried by doing:

list_vectors <- list(data$mass,data$birth_year)
list_ready <- map(list_vectors, 
                  ~ data %>%
                    group_by(skin_color)%>%
                    dplyr::summarise_all(funs(sum(!is.na(.))))%>%
                    dplyr::summarise_all(., list(mean,sd), na.rm=T) %>%
                    dplyr::ungroup()%>%
                    dplyr::mutate(var_se=var_sd/sqrt(var_n))) 

vector_1 <- list_ready[[1]]

But this doesn't work. Any help is really really appreciated! (I really want to use the purrr package).

Andy Baxter · Accepted Answer

Simplest way might be to a) put your calculation steps into a function processing a vector and returning a list of a tibble with the values you need and b) passing this into across instead (using iris as an example instead):

library(tidyverse)

mean_ci <- function(vars) {
  vars <- vars[!is.na(vars)]
  mn <- mean(vars)
  se <- sd(vars) / sqrt(length(vars))
  tibble(
    mean = mn,
    lower = mn - qt(1 - (0.05 / 2), length(vars) - 1) * se,
    upper = mn + qt(1 - (0.05 / 2), length(vars) - 1) * se
  )
}

iris |>
  group_by(Species) |>
  summarise(across(where(is.numeric), mean_ci)) |> 
  unnest(where(is_tibble), names_sep = "_")

#> # A tibble: 3 × 13
#>   Species    Sepal.Len…¹ Sepal…² Sepal…³ Sepal…⁴ Sepal…⁵ Sepal…⁶ Petal…⁷ Petal…⁸
#>                                    
#> 1 setosa            5.01    4.91    5.11    3.43    3.32    3.54    1.46    1.41
#> 2 versicolor        5.94    5.79    6.08    2.77    2.68    2.86    4.26    4.13
#> 3 virginica         6.59    6.41    6.77    2.97    2.88    3.07    5.55    5.40
#> # … with 4 more variables: Petal.Length_upper , Petal.Width_mean ,
#> #   Petal.Width_lower , Petal.Width_upper , and abbreviated variable
#> #   names ¹Sepal.Length_mean, ²Sepal.Length_lower, ³Sepal.Length_upper,
#> #   ⁴Sepal.Width_mean, ⁵Sepal.Width_lower, ⁶Sepal.Width_upper,
#> #   ⁷Petal.Length_mean, ⁸Petal.Length_lower

A more purrr-y way to do it might be to map the function to nested dataframes to create a slightly longer-form data output:

iris |> 
  nest(data = -Species) |> 
  mutate(data = map(data, ~tibble(metric = names(.x), map_df(.x, mean_ci)))) |> 
  unnest(data) 

#> # A tibble: 12 × 5
#>    Species    metric        mean lower upper
#>                    
#>  1 setosa     Sepal.Length 5.01  4.91  5.11 
#>  2 setosa     Sepal.Width  3.43  3.32  3.54 
#>  3 setosa     Petal.Length 1.46  1.41  1.51 
#>  4 setosa     Petal.Width  0.246 0.216 0.276
#>  5 versicolor Sepal.Length 5.94  5.79  6.08 
#>  6 versicolor Sepal.Width  2.77  2.68  2.86 
#>  7 versicolor Petal.Length 4.26  4.13  4.39 
#>  8 versicolor Petal.Width  1.33  1.27  1.38 
#>  9 virginica  Sepal.Length 6.59  6.41  6.77 
#> 10 virginica  Sepal.Width  2.97  2.88  3.07 
#> 11 virginica  Petal.Length 5.55  5.40  5.71 
#> 12 virginica  Petal.Width  2.03  1.95  2.10

How to create confidence intervals for multiple columns using purrr and dplyr packages in R?

Answers (1)

Related Questions