group by and summarize with removed duplicates

Question

We can use the following data frame as an example:

Cases <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah")
Procedures <- c("1", "1", "2", "3", "3", "4", "1")

(df <- data.frame(Cases, Procedures))

       Cases Procedures
1 Siddhartha          1
2 Siddhartha          1
3 Siddhartha          2
4       Paul          3
5       Paul          3
6       Paul          4
7     Hannah          1

Now i do the following:

Sum_Group <- function(df, variable){
  variable <- enquo(variable)

  df %>%
    dplyr::group_by(!! variable) %>%
    dplyr::summarize(Number = n()) %>%
    dplyr::mutate(Prozent = round((Number/sum(Number)*100)))
}

Sum_Group(df, Procedures)

which gives me:

# A tibble: 4 x 3
  Procedures Number Prozent
            
1 1               3      43
2 2               1      14
3 3               2      29
4 4               1      14

This is not exactly, what i want though. What i want is the following data frame:

  Procedures Number Prozent
            
1 1               2      40
2 2               1      20
3 3               1      20
4 4               1      20

Notice the difference in Procedure 1 and 3.

So what i would like is a function, that summarizes multiple occurrences of the same procedure for one case as 1 and not as in the first example, as multiple occurrences. Also that function should be working on varying data frames, where there are different (unknown) cases and procedures.

I am not sure, if this is easily done and i'm just overlooking something.

Regards

Ronak Shah · Accepted Answer

You want to count the number of distinct cases for each Procedures. You can use n_distinct to count that. Also you can use curly-curly operator ({{}}) which does the job of both enquo and !! together.

library(dplyr)
library(rlang)

Sum_Group <- function(df, variable) {

  df %>%
    group_by({{variable}}) %>%
    summarise(Number = n_distinct(Cases)) %>%
    mutate(Prozent = round((Number/sum(Number)*100)))
}

Sum_Group(df, Procedures)

# A tibble: 4 x 3
#  Procedures Number Prozent
#            
#1 1               2      40
#2 2               1      20
#3 3               1      20
#4 4               1      20

group by and summarize with removed duplicates

Answers (1)

Related Questions