Laura
Laura

Reputation: 513

Using dplyr to create new groups inside a column

This is my dataframe:

mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido", "16 anos", "17 anos", 
"18 anos", "19 anos", "20 anos", "21 a 24 anos", "25 a 29 anos", 
"30 a 34 anos", "35 a 39 anos"), n = c(5202L, 48253L, 67401L, 
79398L, 88233L, 90738L, 149634L, 198848L, 238406L, 265509L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

I would like to have grouped the observations into one group called: 16 a 20 anos.

"16 anos", "17 anos", 
"18 anos", "19 anos", "20 anos"

In other words I would like to "merge" the rows 2-6 and sum its observations on the n column. I would have one row represent the sum of rows 2-6.

Is it possible to do this using group_by and then summarise(sum(DS_FAIXA_ETARIA)) verbs from dplyr?

This would be the output that I want:

mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido","16 a 20 anos" ,"21 a 24 anos", "25 a 29 anos", 
                                   "30 a 34 anos", "35 a 39 anos"), n = c(5202L,374023L , 149634L, 198848L, 238406L, 265509L)), row.names = c(NA, 
                                                                                                                                                      -6L), class = c("tbl_df", "tbl", "data.frame"))

Many thanks

Upvotes: 1

Views: 169

Answers (2)

TarJae
TarJae

Reputation: 78927

This should the job. First sum with summarize. Then add_row to the original dataframe. slice_tail and arrange

df1 <- mydf %>% 
  summarise(`16 a 20 anos`= sum(n[2:6]))

mydf %>% 
  add_row(DS_FAIXA_ETARIA=names(df1), n=df1$`16 a 20 anos`[1]) %>% 
  slice_tail(n=5) %>% 
  arrange(DS_FAIXA_ETARIA)

Output:

  DS_FAIXA_ETARIA      n
  <chr>            <int>
1 16 a 20 anos    374023
2 21 a 24 anos    149634
3 25 a 29 anos    198848
4 30 a 34 anos    238406
5 35 a 39 anos    265509

Upvotes: 2

akrun
akrun

Reputation: 887118

We create a grouping variable based on the occurrence of 'Invalido' or those elements with only digits (\\d+) followed by space and 'anos', then summarise by pasteing the first and last elements while getting the sum of 'n'

library(dplyr)
library(stringr)
mydf %>% 
  group_by(grp = replace(cumsum(!str_detect(DS_FAIXA_ETARIA,
     '^\\d+\\s+anos$')), DS_FAIXA_ETARIA == 'Inválido', 0))  %>% 
  summarise(DS_FAIXA_ETARIA = if(n() > 1) 
      str_c(DS_FAIXA_ETARIA[c(1, n())], collapse="_") else 
    DS_FAIXA_ETARIA, n = sum(n), .groups = 'drop') %>%
  select(-grp)

-output

# A tibble: 6 x 2
#  DS_FAIXA_ETARIA      n
#  <chr>            <int>
#1 Inválido          5202
#2 16 anos_20 anos 374023
#3 21 a 24 anos    149634
#4 25 a 29 anos    198848
#5 30 a 34 anos    238406
#6 35 a 39 anos    265509

Upvotes: 1

Related Questions