Ekholme
Ekholme

Reputation: 393

Summarizing at multiple levels of hierarchy

I've got a dataset with four levels: observations (i.e. a period in which a teacher is observed), teachers, schools, school divisions. Observations are nested within teachers, who are nested within schools, etc.

Each row in the data corresponds to an instance a teacher is observed.

At each level of the hierarchy, I want to compute the mean, sd, min, and max for each of several variables (x1, x2, and x3 in the simulated data, but there are ~12 in the actual data). And I want all of these summaries in a single dataframe.

The code below will do it, but it feels clunky to me. More specifically, a few things bothering me are:

  1. I couldn't figure out how to rename within the function I wrote using the group_var value, so I resorted to manually doing this outside of the functions.
  2. I end up creating multiple dataframes and then using left_join to join them together at the end (again manually).
  3. Ultimately, I feel like there is probably a way (possibly using something in purrr to "peel back" layers of hierarchy and aggregate, but it's eluding me.

Any advice on how to streamline this, and particularly how to pass the group_var values to rename_at, would be much appreciated!

library(tidyverse)
library(treemap)

df <- random.hierarchical.data(n = 200, depth = 4) %>%
  rename(div = index1,
         sch = index2,
         teacher = index3,
         obs = index4,
         x1 = x) %>%
  mutate(x2 = rlnorm(200),
         x3 = rlnorm(200))

sum_func <- function(data, sum_vars, ...) {
  group_vars <- enquos(...)

  data %>%
    group_by(!!!group_vars) %>%
    summarize_at(vars(sum_vars),
                 list(
                   ~mean(., na.rm = TRUE),
                   ~sd(., na.rm = TRUE),
                   ~min(., na.rm = TRUE),
                   ~max(., na.rm = TRUE)
                 )) %>%
    ungroup()
}

use_vars <- c("x1", "x2", "x3")

teacher_sum <- sum_func(data = df, sum_vars = use_vars, div, sch, teacher) %>%
  rename_at(vars(-c("teacher", "sch", "div")), ~str_replace_all(., "^", "teacher_"))

sch_sum <- sum_func(df, sum_vars = use_vars, div, sch) %>%
  rename_at(vars(-c("sch", "div")), ~str_replace_all(., "^", "sch_"))

div_sum <- sum_func(df, sum_vars = use_vars, div) %>%
  rename_at(vars(-c("div")), ~str_replace_all(., "^", "div_"))

full <- teacher_sum %>%
  left_join(sch_sum, by = c("sch", "div")) %>%
  left_join(div_sum, by = "div")

Upvotes: 1

Views: 366

Answers (1)

mnist
mnist

Reputation: 6954

You have been quite close. The code below works yet I am unsure how to automate the joining completely since the logic is not clear to me

sum_func <- function(data, sum_vars, replacement, ...) {
  group_vars <- enquos(...)

  data %>%
    group_by(!!!group_vars) %>%
    summarize_at(vars(sum_vars),
                 list(
                   ~mean(., na.rm = TRUE),
                   ~sd(., na.rm = TRUE),
                   ~min(., na.rm = TRUE),
                   ~max(., na.rm = TRUE)
                 )) %>%
    ungroup() %>%
    rename_at(vars(-c(!!!group_vars)), 
              ~str_replace_all(., "^", replacement))
}

use_vars <- c("x1", "x2", "x3")

teacher_sum <- sum_func(data = df, 
                        sum_vars = use_vars, 
                        replacement = "teacher_",
                        div, sch, teacher)

sch_sum <- sum_func(data = df, 
                    sum_vars = use_vars, 
                    replacement = "sch_",
                    div, sch)
div_sum <- sum_func(df, 
                    sum_vars = use_vars, 
                    replacement = "div_",
                    div)

Upvotes: 2

Related Questions