Raesu
Raesu

Reputation: 310

dplyr use n() within summarize to get row count of different grouping

My data frame includes region, date (by day), and general response data for a survey. I have grouped and summarized by region and date, so that I have a daily count of completed surveys (looking at a boolean Finished variable). If I use n(), I get a total for a region on that date (since I grouped the data frame by both of these).

I would like to add the total based on only region, across all dates. Is this possible using the dplyr methods? I have currently calculated the region totals, and after summarizing my table, have manually added a column using case_when to add the region totals. The goal is to make cumulative completion percentages per region in ggplot.

Is my approach the only way, or is it possible to use dplyr methods to remove a grouping level when calculating n()?

Example below. DailyCompeted is the sum of boolean Finished, and total is giving the total count for that day in that region. My goal is for total to be the sum of that column (92), per region in every row.

x %>% group_by(Region, EndDate) %>% summarize(DailyCompleted=sum(Finished), total=n())

Region EndDate    DailyCompleted total
   <fct>  <date>              <int> <int>
 1 AMER   2019-04-29              0     1
 2 AMER   2019-04-30              0     1
 3 AMER   2019-05-02              1     1
 4 AMER   2019-05-03              1     1
 5 AMER   2019-05-06             40    54
 6 AMER   2019-05-07             12    17
 7 AMER   2019-05-08              4     7
 8 AMER   2019-05-09              3     5
 9 AMER   2019-05-10              3     4
10 AMER   2019-05-12              1     1

Upvotes: 0

Views: 2797

Answers (1)

Len Greski
Len Greski

Reputation: 10845

We can reaggregate the summarized data by Region, and then join it bak to the data frame that has been aggregated by EndDate.

textFile <- "Region EndDate    DailyCompleted total
 AMER   2019-04-29              0     1
 AMER   2019-04-30              0     1
 AMER   2019-05-02              1     1
 AMER   2019-05-03              1     1
 AMER   2019-05-06             40    54
 AMER   2019-05-07             12    17
 AMER   2019-05-08              4     7
 AMER   2019-05-09              3     5
 AMER   2019-05-10              3     4
AMER   2019-05-12              1     1"

regionDay <- read.table(text = textFile,header = TRUE)

library(dplyr)
regionDay %>% group_by(Region) %>%
     summarise(regionDay,region_total = sum(total)) %>%
     inner_join(regionDay,.)
region_tot

...and the output is:

   Region    EndDate DailyCompleted total region_total
1    AMER 2019-04-29              0     1           92
2    AMER 2019-04-30              0     1           92
3    AMER 2019-05-02              1     1           92
4    AMER 2019-05-03              1     1           92
5    AMER 2019-05-06             40    54           92
6    AMER 2019-05-07             12    17           92
7    AMER 2019-05-08              4     7           92
8    AMER 2019-05-09              3     5           92
9    AMER 2019-05-10              3     4           92
10   AMER 2019-05-12              1     1           92
> 

Upvotes: 1

Related Questions