Reputation: 43

How can I use the dplyr package in r to group a date variable in a dataframe by month and summarise numeric and categorical variables?

I want to group by month and summarise all variables (numeric and categorical variable for each month). Particularly, the categorical variable in the resulting tibble should show the most frequent level and its frequency (in percentage) for each month. Assume an example data set below:

date <- as.Date(c("2021-03-13", 
                  "2021-03-12", 
                  "2021-04-14", 
                  "2021-04-17", 
                  "2021-04-17", 
                  "2021-05-17", "2021-05-17", "2021-06-17", "2021-07-17", "2021-07-17"))
Partograph_use <- as.factor(c("Partograph", "Partograph","Partograph",
                              "Partograph","Partograph", "Partograph","labor care guide" ,
                              "labor care guide", "labor care guide" , "labor care guide"))
duration_labor <- as.numeric(c(12, 5, 6, 5, 5, 6, 7, 10, 10, 5))
augument_ox <- as.factor(c("Yes", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No"))
urgent_csection <- as.factor(c("Yes", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No"))                    Yes

Then the example data frame:

test.df <- cbind.data.frame(date, Partograph_use, duration_labor, augument_ox, urgent_csection)

I have tried using lubridate function and dplyr package to group the date by month, but was only successful for the numeric variable "duration_of_labour_hours" using this code:

lcg.df %>% group_by(month = lubridate::floor_date(date_of_admission,'month')) %>% 
summarize(median_dur_labor = median(duration_of_labour_hours, na.rm = TRUE), 
quat_5th = quantile(duration_of_labour_hours, probs = 0.05, na.rm = TRUE), 
quat_95th = quantile(duration_of_labour_hours, probs = 0.95, na.rm = TRUE))

Unfortunately, I failed to figure out how to summarise the categorical variables, such that the tibble displays the levels with the highest frequency and its respective frequency (in percentage) for each month row.

Upvotes: 3

Answers (2)

Dave2e

Reputation: 24109

Bit of a brute approach but you will get a table which you can then work with:

lcg.df %>% group_by(month = lubridate::floor_date(date_of_admission,'month')) %>% 
   reframe(median_dur_labor = median(duration_of_labour_hours, na.rm = TRUE), 
             quat_5th = quantile(duration_of_labour_hours, probs = 0.05, na.rm = TRUE), 
             quat_95th = quantile(duration_of_labour_hours, probs = 0.95, na.rm = TRUE),
             count=n(), artifical_yes = sum(artificial_rupture_of_memb=="Yes"),
           artifical_no = sum(artificial_rupture_of_memb=="No"),
           augmentation_yes = sum(augmentation_with_oxytocin=="Yes"),
           augmentation_no = sum(augmentation_with_oxytocin=="No")
           )


  month      median_dur_labor quat_5th quat_95th count artifical_yes artifical_no augmentation_yes augmentation_no
  <date>                <dbl>    <dbl>     <dbl> <int>         <int>        <int>            <int>           <int>
1 2021-07-01              7       3.4       11.6    49            39           10               16              33
2 2021-08-01              7       5         10      21            17            4               12               9
3 2021-09-01              8       4         14      21            19            2               13               8
4 2021-10-01              5.5     3.35      10.6     8             6            2                3               5
5 2023-08-01             18      18         18       1             1            0                1               0

Upvotes: 2

r2evans

Reputation: 160647

Try this.

quux %>%
  mutate(ym = lubridate::floor_date(date_of_admission, unit="months")) %>%
  group_by(ym) %>%
  reframe(
    across(where(is.numeric), list(mean = ~ mean(.), median = ~ median(.))),
    across(where(is.character), ~ { tb <- table(.); 100 * sort(tb)[1]/sum(tb); }, .names = "{.col}_pct"),
    across(where(is.character), ~ names(sort(table(.)))[1], .names = "{.col}_mode")
  )
# # A tibble: 5 × 9
#   ym         duration_of_labour_hours_mean duration_of_labour_hours_median labour_monitoring_pct artificial_rupture_of_memb_pct augmentation_with_oxytocin_pct labour_monitoring_mode artificial_rupture_of_memb_mode augmentation_with_ox…¹
#   <date>                             <dbl>                           <dbl>                 <dbl>                          <dbl>                          <dbl> <chr>                  <chr>                           <chr>                 
# 1 2021-07-01                          7.24                             7                     100                          20.4                            32.7 Partograph             No                              Yes                   
# 2 2021-08-01                          7.19                             7                     100                          19.0                            42.9 Partograph             No                              No                    
# 3 2021-09-01                          8.19                             8                     100                           9.52                           38.1 Partograph             No                              No                    
# 4 2021-10-01                          6.12                             5.5                   100                          25                              37.5 Partograph             No                              Yes                   
# 5 2023-08-01                         18                               18                     100                         100                             100   Partograph             Yes                             Yes                   
# # ℹ abbreviated name: ¹augmentation_with_oxytocin_mode

It's a bit unfortunate here that I'm calculating the table(.) of character columns twice; if your data is larger, it's not too difficult to come up with a function that does it once and returns (say) a list-column or similar, and then you can use another across(.) "accessor" to extract the two components. This should only be necessary if your data is rather large and the double-table is too costly.

The calcs on the character columns need to come last, since otherwise their "percent" columns will be discovered in the numeric summaries and double-summarized (just unnecessary).

The use of a named-function of across is what really helps the most here: list(mu = ~ mean(.)) is a simple example, you can use whatever names you want (LHS of =), and the RHS can be either a ~-function as I've shown here, it can be a more formal function(z) {...}, or it can be a named function (e.g., list(mu = mean) also works, though sometimes discouraged for reasons I do not know). This allows you to arbitrarily add others (e.g., list(..., q90 = ~ quantile(., 0.9))).

I used where(is.character), but if you have factors as well then you likely want them considered as well. For that, you can use where(~ is.character(.) | is.factor(.)) instead.

Data

quux <- structure(list(date_of_admission = structure(c(18821, 18820, 18822, 18824, 18825, 18824, 18825, 18824, 18826, 18825, 18824, 18826, 18825, 18820, 18821, 18821, 18825, 18821, 18821, 18812, 18820, 18821, 18811, 18809, 18809, 18817, 18818, 18824, 18823, 18813, 18813, 18812, 18812, 18815, 18815, 18814, 18810, 18810, 18811, 18832, 18829, 18823, 19591, 18902, 18923, 18830, 18869, 18811, 18810, 18861, 18922, 18867, 18884, 18885, 18852, 18852, 18879, 18816, 18848, 18850, 18879, 18857, 18877, 18878, 18816,  18822, 18829, 18864, 18854, 18864, 18854, 18879, 18879, 18863, 18871, 18855, 18878, 18878, 18874, 18850, 18855, 18871, 18878, 18881, 18874, 18874, 18874, 18882, 18873, 18930, 18930, 18930, 18842, 18862, 18929, 18873, 18929, 18842, 18842, 18842), class = "Date"), labour_monitoring = c("Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph",  "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph",  "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph",  "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph"), duration_of_labour_hours = c(12L, 5L, 5L, 8L, 6L, 7L, 10L, 10L, 5L, 7L, 7L, 8L, 8L, 6L, 6L, 7L, 8L, 9L, 8L, 11L, 12L, 7L, 10L, 6L, 5L, 12L, 9L, 10L, 6L, 5L, 6L, 6L, 5L, 5L, 3L, 3L, 7L, 8L, 3L, 5L, 11L, 6L, 18L, 6L, 4L, 4L, 8L, 11L, 5L, 10L, 8L, 7L, 8L, 10L, 7L, 8L, 8L, 10L, 7L, 9L, 4L, 5L, 10L, 5L, 8L, 8L, 6L, 8L, 8L, 13L, 6L, 4L,  14L, 5L, 4L, 6L, 12L, 7L, 15L, 4L, 7L, 9L, 6L, 6L, 7L, 12L, 5L, 10L, 11L, 4L, 3L, 5L, 9L, 6L, 7L, 5L, 12L, 5L, 6L, 7L), artificial_rupture_of_memb = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",  "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes"), augmentation_with_oxytocin = c("No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No",  "No", "No", "No", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes" )), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82",  "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100"), class = "data.frame")

Upvotes: 5

How can I use the dplyr package in r to group a date variable in a dataframe by month and summarise numeric and categorical variables?

Answers (2)

Related Questions