Is there a way to write this in a single Dplyr statement / more efficiently?

Question

My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.

Data example:

contributor_occupation contribution_receipt_amount
1                    LISTING COORDINATOR                        5.00
2                           NOT EMPLOYED                        2.70
3                                TEACHER                        2.70
4                    ELECTRICAL DESIGNER                        2.00
5                                STUDENT                       50.00
6                      SOFTWARE ENGINEER                       10.00
7                           TRUCK DRIVER                        2.70
8                           NOT EMPLOYED                       50.00
9                             CONTRACTOR                        5.00
10                              ENGINEER                        6.00
11                                FARMER                        2.70
12                                ARTIST                       50.00
13                         CIRCUS ARTIST                      100.00
14                         CIRCUS ARTIST                       27.00
15          INFORMATION SECURITY ANALYST                        2.00
16                                LAWYER                        5.00

occupation2 <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarise(total = sum(contribution_receipt_amount)) %>%
  arrange(desc(contributor_occupation))


occupation3 <- b %>%
  select(contributor_occupation) %>%
  count(contributor_occupation) %>%
  group_by(contributor_occupation) %>%
  arrange(desc(contributor_occupation))

final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column 

occ_avg <- final_occ %>% 
  select(contributor_occupation:n) %>%
  mutate("Average Donation" = total/n) %>%
  rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)

occ_avg %>%
  arrange(desc(`Average Donation`))

This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:

avg_donation_occupation <- b %>%
  select(contributor_occupation, contribution_receipt_amount) %>%
  group_by(contributor_occupation) %>%
  summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
  arrange(desc(avg_donation_by_occupation))

Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?

Thanks for the help clearing up any confusion!

akrun · Accepted Answer

We may need both sum and mean along with n() which gives the number of observations in the grouped data. According to ?context

n() gives the current group size.

and `?mean

mean - Generic function for the (trimmed) arithmetic mean.

which is basically the sum of observations divided by the number of observations

library(dplyr)
out <- b %>% 
  group_by(Occupation = contributor_occupation) %>% 
  summarise(`Total Donated` = sum(contribution_receipt_amount), 
      `Number of Donations` = n(),
   `Average Donation` = mean(contribution_receipt_amount),
   #or
   #`Average Donation` = `Total Donated`/`Number of Donations`,    
      .groups = 'drop') %>% 
   
  arrange(desc(`Average Donation`))

-output

out
# A tibble: 14 × 4
   Occupation                   `Total Donated` `Number of Donations` `Average Donation`
                                                                    
 1 CIRCUS ARTIST                          127                       2               63.5
 2 ARTIST                                  50                       1               50  
 3 STUDENT                                 50                       1               50  
 4 NOT EMPLOYED                            52.7                     2               26.4
 5 SOFTWARE ENGINEER                       10                       1               10  
 6 ENGINEER                                 6                       1                6  
 7 CONTRACTOR                               5                       1                5  
 8 LAWYER                                   5                       1                5  
 9 LISTING COORDINATOR                      5                       1                5  
10 FARMER                                   2.7                     1                2.7
11 TEACHER                                  2.7                     1                2.7
12 TRUCK DRIVER                             2.7                     1                2.7
13 ELECTRICAL DESIGNER                      2                       1                2  
14 INFORMATION SECURITY ANALYST             2                       1                2

data

b <- structure(list(contributor_occupation = c("LISTING COORDINATOR", 
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT", 
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR", 
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST", 
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5, 
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16"))

Is there a way to write this in a single Dplyr statement / more efficiently?

Answers (1)

data

Related Questions