Ryan
Ryan

Reputation: 1068

How to summarizing nested groups in R

In a data frame like data below:

library(tidyverse)
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y","Z", "a","b","c","d")
State <- rep(c("FL", "GA", "SC", "NC", "VA", "GA"), each = 5)
Location <- rep(c("alpha", "beta", "gamma"), each = 10)
Var3 <- rep(c("Bravo", "Charlie", "Delta", "Echo"), times = c(7,8,10,5))
Sex <- rep(c("M","F","M"), times = 10)
data <- data.frame(ID, State, Location, Var3, Sex)

I want to return a data frame, or a list of several data frames, that summarize each way the data can be grouped. I want to see how many individual IDs are in each State, Location, and Var3, how many M and F are in each State, Location, and Var3, how many Locations are in each State, ect... what is the best way to achieve this.

Upvotes: 2

Views: 666

Answers (2)

akrun
akrun

Reputation: 887128

We can use count

library(dplyr)
data %>%
  count(State, Location, Var3, Sex)

Also, to get rollup/cube way of hierarchial counts,

library(data.table)
rollup(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
#    State Location    Var3  Sex  N
# 1:    FL    alpha   Bravo    M  3
# 2:    FL    alpha   Bravo    F  2
# 3:    GA    alpha   Bravo    M  2
# 4:    GA    alpha Charlie    F  1
# 5:    GA    alpha Charlie    M  2
# 6:    SC     beta Charlie    F  2
# 7:    SC     beta Charlie    M  3
# 8:    NC     beta   Delta    M  3
# 9:    NC     beta   Delta    F  2
#10:    VA    gamma   Delta    M  4
#11:    VA    gamma   Delta    F  1
#12:    GA    gamma    Echo    F  2
#13:    GA    gamma    Echo    M  3
#14:    FL    alpha   Bravo <NA>  5
#15:    GA    alpha   Bravo <NA>  2
#16:    GA    alpha Charlie <NA>  3
#17:    SC     beta Charlie <NA>  5
#18:    NC     beta   Delta <NA>  5
#19:    VA    gamma   Delta <NA>  5
#20:    GA    gamma    Echo <NA>  5
#21:    FL    alpha    <NA> <NA>  5
#22:    GA    alpha    <NA> <NA>  5
#23:    SC     beta    <NA> <NA>  5
#24:    NC     beta    <NA> <NA>  5
#25:    VA    gamma    <NA> <NA>  5
#26:    GA    gamma    <NA> <NA>  5
#27:    FL     <NA>    <NA> <NA>  5
#28:    GA     <NA>    <NA> <NA> 10
#29:    SC     <NA>    <NA> <NA>  5
#30:    NC     <NA>    <NA> <NA>  5
#31:    VA     <NA>    <NA> <NA>  5
#32:  <NA>     <NA>    <NA> <NA> 30
#    State Location    Var3  Sex  N

Or use cube

cube(as.data.table(data), j = .N, by = c("State","Location","Var3", "Sex"))
#.   State Location    Var3  Sex  N
#  1:    FL    alpha   Bravo    M  3
#  2:    FL    alpha   Bravo    F  2
#  3:    GA    alpha   Bravo    M  2
#  4:    GA    alpha Charlie    F  1
#  5:    GA    alpha Charlie    M  2
# ---                               
#111:  <NA>     <NA>   Delta <NA> 10
#112:  <NA>     <NA>    Echo <NA>  5
#113:  <NA>     <NA>    <NA>    M 20
#114:  <NA>     <NA>    <NA>    F 10
#115:  <NA>     <NA>    <NA> <NA> 30

Upvotes: 2

tmfmnk
tmfmnk

Reputation: 39858

One dplyr and purrr solution to group by all possible combinations of column names could be:

map2(list(colnames(data)), 
     1:ncol(data),
     combn, simplify = FALSE) %>%
 flatten() %>%
 map(~ data %>%
      group_by_at(.x) %>%
      tally())

In this case, there are 31 possible combinations of column names, so it returns 31 lists. The first three lists:

[[1]]
# A tibble: 30 x 2
   ID        n
   <fct> <int>
 1 a         1
 2 A         1
 3 b         1
 4 B         1
 5 c         1
 6 C         1
 7 d         1
 8 D         1
 9 E         1
10 F         1
# … with 20 more rows

[[2]]
# A tibble: 5 x 2
  State     n
  <fct> <int>
1 FL        5
2 GA       10
3 NC        5
4 SC        5
5 VA        5

[[3]]
# A tibble: 3 x 2
  Location     n
  <fct>    <int>
1 alpha       10
2 beta        10
3 gamma       10

Upvotes: 1

Related Questions