iboboboru
iboboboru

Reputation: 1102

Collapse rows of dplyr tibble based on cumulative frequency

#Generate some data
set.seed(1234)
rows = 100
created_data <- data.frame(index = 1:rows,
                           catsA = sample((letters[1:5]),rows,replace=T),
                           valueA = round(rnorm(rows),3))

Using dplyr create a count of categories and order by the count.

library(dplyr)

count_of_cat <- created_data %>% 
  group_by(catsA) %>%
  summarise(rowcount = n()) %>%
  ungroup %>%
  arrange(-rowcount) %>%
  mutate(rel.freq = round(rowcount/sum(rowcount),3)) %>%
  mutate(cum.freq = cumsum(rel.freq))

Output

 catsA rowcount rel.freq cum.freq
1     b       26     0.26     0.26
2     a       25     0.25     0.51
3     c       17     0.17     0.68
4     d       17     0.17     0.85
5     e       15     0.15     1.00

Is there a good way to rollup rows after say cum.freq > 0.50

Desired output

 catsA rowcount rel.freq cum.freq
1     b       26     0.26     0.26
2     a       25     0.25     0.51
3     new     49     0.49     1.00

Upvotes: 0

Views: 349

Answers (1)

iboboboru
iboboboru

Reputation: 1102

Worked it out from here dplyr mutate rowSums calculations or custom functions

count_of_cat %>% filter(cum.freq <= 0.51) %>%
  rbind(
    count_of_cat %>% filter(cum.freq > 0.51) %>%
  summarise(catsA = "new", 
            rowcount = sum(rowcount),
            rel.freq = sum(rel.freq),
            cum.freq = 1.00)
  )

Upvotes: 1

Related Questions