Reputation: 43
I want to group by month and summarise all variables (numeric and categorical variable for each month). Particularly, the categorical variable in the resulting tibble should show the most frequent level and its frequency (in percentage) for each month. Assume an example data set below:
date <- as.Date(c("2021-03-13",
"2021-03-12",
"2021-04-14",
"2021-04-17",
"2021-04-17",
"2021-05-17", "2021-05-17", "2021-06-17", "2021-07-17", "2021-07-17"))
Partograph_use <- as.factor(c("Partograph", "Partograph","Partograph",
"Partograph","Partograph", "Partograph","labor care guide" ,
"labor care guide", "labor care guide" , "labor care guide"))
duration_labor <- as.numeric(c(12, 5, 6, 5, 5, 6, 7, 10, 10, 5))
augument_ox <- as.factor(c("Yes", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No"))
urgent_csection <- as.factor(c("Yes", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No")) Yes
Then the example data frame:
test.df <- cbind.data.frame(date, Partograph_use, duration_labor, augument_ox, urgent_csection)
I have tried using lubridate function and dplyr package to group the date by month, but was only successful for the numeric variable "duration_of_labour_hours" using this code:
lcg.df %>% group_by(month = lubridate::floor_date(date_of_admission,'month')) %>%
summarize(median_dur_labor = median(duration_of_labour_hours, na.rm = TRUE),
quat_5th = quantile(duration_of_labour_hours, probs = 0.05, na.rm = TRUE),
quat_95th = quantile(duration_of_labour_hours, probs = 0.95, na.rm = TRUE))
Unfortunately, I failed to figure out how to summarise the categorical variables, such that the tibble displays the levels with the highest frequency and its respective frequency (in percentage) for each month row.
Upvotes: 3
Views: 85
Reputation: 24109
Bit of a brute approach but you will get a table which you can then work with:
lcg.df %>% group_by(month = lubridate::floor_date(date_of_admission,'month')) %>%
reframe(median_dur_labor = median(duration_of_labour_hours, na.rm = TRUE),
quat_5th = quantile(duration_of_labour_hours, probs = 0.05, na.rm = TRUE),
quat_95th = quantile(duration_of_labour_hours, probs = 0.95, na.rm = TRUE),
count=n(), artifical_yes = sum(artificial_rupture_of_memb=="Yes"),
artifical_no = sum(artificial_rupture_of_memb=="No"),
augmentation_yes = sum(augmentation_with_oxytocin=="Yes"),
augmentation_no = sum(augmentation_with_oxytocin=="No")
)
month median_dur_labor quat_5th quat_95th count artifical_yes artifical_no augmentation_yes augmentation_no
<date> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
1 2021-07-01 7 3.4 11.6 49 39 10 16 33
2 2021-08-01 7 5 10 21 17 4 12 9
3 2021-09-01 8 4 14 21 19 2 13 8
4 2021-10-01 5.5 3.35 10.6 8 6 2 3 5
5 2023-08-01 18 18 18 1 1 0 1 0
Upvotes: 2
Reputation: 160647
Try this.
quux %>%
mutate(ym = lubridate::floor_date(date_of_admission, unit="months")) %>%
group_by(ym) %>%
reframe(
across(where(is.numeric), list(mean = ~ mean(.), median = ~ median(.))),
across(where(is.character), ~ { tb <- table(.); 100 * sort(tb)[1]/sum(tb); }, .names = "{.col}_pct"),
across(where(is.character), ~ names(sort(table(.)))[1], .names = "{.col}_mode")
)
# # A tibble: 5 × 9
# ym duration_of_labour_hours_mean duration_of_labour_hours_median labour_monitoring_pct artificial_rupture_of_memb_pct augmentation_with_oxytocin_pct labour_monitoring_mode artificial_rupture_of_memb_mode augmentation_with_ox…¹
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 2021-07-01 7.24 7 100 20.4 32.7 Partograph No Yes
# 2 2021-08-01 7.19 7 100 19.0 42.9 Partograph No No
# 3 2021-09-01 8.19 8 100 9.52 38.1 Partograph No No
# 4 2021-10-01 6.12 5.5 100 25 37.5 Partograph No Yes
# 5 2023-08-01 18 18 100 100 100 Partograph Yes Yes
# # ℹ abbreviated name: ¹augmentation_with_oxytocin_mode
It's a bit unfortunate here that I'm calculating the table(.)
of character columns twice; if your data is larger, it's not too difficult to come up with a function that does it once and returns (say) a list-column or similar, and then you can use another across(.)
"accessor" to extract the two components. This should only be necessary if your data is rather large and the double-table
is too costly.
The calcs on the character columns need to come last, since otherwise their "percent" columns will be discovered in the numeric summaries and double-summarized (just unnecessary).
The use of a named-function of across
is what really helps the most here: list(mu = ~ mean(.))
is a simple example, you can use whatever names you want (LHS of =
), and the RHS can be either a ~
-function as I've shown here, it can be a more formal function(z) {...}
, or it can be a named function (e.g., list(mu = mean)
also works, though sometimes discouraged for reasons I do not know). This allows you to arbitrarily add others (e.g., list(..., q90 = ~ quantile(., 0.9))
).
I used where(is.character)
, but if you have factor
s as well then you likely want them considered as well. For that, you can use where(~ is.character(.) | is.factor(.))
instead.
Data
quux <- structure(list(date_of_admission = structure(c(18821, 18820, 18822, 18824, 18825, 18824, 18825, 18824, 18826, 18825, 18824, 18826, 18825, 18820, 18821, 18821, 18825, 18821, 18821, 18812, 18820, 18821, 18811, 18809, 18809, 18817, 18818, 18824, 18823, 18813, 18813, 18812, 18812, 18815, 18815, 18814, 18810, 18810, 18811, 18832, 18829, 18823, 19591, 18902, 18923, 18830, 18869, 18811, 18810, 18861, 18922, 18867, 18884, 18885, 18852, 18852, 18879, 18816, 18848, 18850, 18879, 18857, 18877, 18878, 18816, 18822, 18829, 18864, 18854, 18864, 18854, 18879, 18879, 18863, 18871, 18855, 18878, 18878, 18874, 18850, 18855, 18871, 18878, 18881, 18874, 18874, 18874, 18882, 18873, 18930, 18930, 18930, 18842, 18862, 18929, 18873, 18929, 18842, 18842, 18842), class = "Date"), labour_monitoring = c("Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph", "Partograph"), duration_of_labour_hours = c(12L, 5L, 5L, 8L, 6L, 7L, 10L, 10L, 5L, 7L, 7L, 8L, 8L, 6L, 6L, 7L, 8L, 9L, 8L, 11L, 12L, 7L, 10L, 6L, 5L, 12L, 9L, 10L, 6L, 5L, 6L, 6L, 5L, 5L, 3L, 3L, 7L, 8L, 3L, 5L, 11L, 6L, 18L, 6L, 4L, 4L, 8L, 11L, 5L, 10L, 8L, 7L, 8L, 10L, 7L, 8L, 8L, 10L, 7L, 9L, 4L, 5L, 10L, 5L, 8L, 8L, 6L, 8L, 8L, 13L, 6L, 4L, 14L, 5L, 4L, 6L, 12L, 7L, 15L, 4L, 7L, 9L, 6L, 6L, 7L, 12L, 5L, 10L, 11L, 4L, 3L, 5L, 9L, 6L, 7L, 5L, 12L, 5L, 6L, 7L), artificial_rupture_of_memb = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes"), augmentation_with_oxytocin = c("No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes" )), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97", "98", "99", "100"), class = "data.frame")
Upvotes: 5