Reputation: 61
My (simplified) dataset consists of donor occupation and contribution amounts. I'm trying to determine what the average contribution amount by occupation is (note: donor occupations are often repeated in the column, so I use that as a grouping variable). Right now, I'm using two dplyr statements -- one to get a sum of contributions amount by each occupation and another to get a count of the number of donations from that specific occupation. I am then binding the dataframes with cbind and creating a new column with mutate, where I can divide the sum by the count.
Data example:
contributor_occupation contribution_receipt_amount
1 LISTING COORDINATOR 5.00
2 NOT EMPLOYED 2.70
3 TEACHER 2.70
4 ELECTRICAL DESIGNER 2.00
5 STUDENT 50.00
6 SOFTWARE ENGINEER 10.00
7 TRUCK DRIVER 2.70
8 NOT EMPLOYED 50.00
9 CONTRACTOR 5.00
10 ENGINEER 6.00
11 FARMER 2.70
12 ARTIST 50.00
13 CIRCUS ARTIST 100.00
14 CIRCUS ARTIST 27.00
15 INFORMATION SECURITY ANALYST 2.00
16 LAWYER 5.00
occupation2 <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarise(total = sum(contribution_receipt_amount)) %>%
arrange(desc(contributor_occupation))
occupation3 <- b %>%
select(contributor_occupation) %>%
count(contributor_occupation) %>%
group_by(contributor_occupation) %>%
arrange(desc(contributor_occupation))
final_occ <- cbind(occupation2, occupation3[, 2]) # remove duplicate column
occ_avg <- final_occ %>%
select(contributor_occupation:n) %>%
mutate("Average Donation" = total/n) %>%
rename("Number of Donations"= n, "Occupation" = contributor_occupation, "Total Donated" = total)
occ_avg %>%
arrange(desc(`Average Donation`))
This gives me the result I want but seems like a very cumbersome process. It seems I get the same result by using the following code; however, I am confused as to why it works:
avg_donation_occupation <- b %>%
select(contributor_occupation, contribution_receipt_amount) %>%
group_by(contributor_occupation) %>%
summarize(avg_donation_by_occupation = sum(contribution_receipt_amount)/n()) %>%
arrange(desc(avg_donation_by_occupation))
Wouldn't dividing by n divide by the number of rows (i.e., number of occupations) as opposed to the number of people in that occupation (which is what I used the count function for previously)?
Thanks for the help clearing up any confusion!
Upvotes: 1
Views: 60
Reputation: 887078
We may need both sum
and mean
along with n()
which gives the number of observations in the grouped data. According to ?context
n() gives the current group size.
and `?mean
mean - Generic function for the (trimmed) arithmetic mean.
which is basically the sum
of observations divided by the number of observations
library(dplyr)
out <- b %>%
group_by(Occupation = contributor_occupation) %>%
summarise(`Total Donated` = sum(contribution_receipt_amount),
`Number of Donations` = n(),
`Average Donation` = mean(contribution_receipt_amount),
#or
#`Average Donation` = `Total Donated`/`Number of Donations`,
.groups = 'drop') %>%
arrange(desc(`Average Donation`))
-output
out
# A tibble: 14 × 4
Occupation `Total Donated` `Number of Donations` `Average Donation`
<chr> <dbl> <int> <dbl>
1 CIRCUS ARTIST 127 2 63.5
2 ARTIST 50 1 50
3 STUDENT 50 1 50
4 NOT EMPLOYED 52.7 2 26.4
5 SOFTWARE ENGINEER 10 1 10
6 ENGINEER 6 1 6
7 CONTRACTOR 5 1 5
8 LAWYER 5 1 5
9 LISTING COORDINATOR 5 1 5
10 FARMER 2.7 1 2.7
11 TEACHER 2.7 1 2.7
12 TRUCK DRIVER 2.7 1 2.7
13 ELECTRICAL DESIGNER 2 1 2
14 INFORMATION SECURITY ANALYST 2 1 2
b <- structure(list(contributor_occupation = c("LISTING COORDINATOR",
"NOT EMPLOYED", "TEACHER", "ELECTRICAL DESIGNER", "STUDENT",
"SOFTWARE ENGINEER", "TRUCK DRIVER", "NOT EMPLOYED", "CONTRACTOR",
"ENGINEER", "FARMER", "ARTIST", "CIRCUS ARTIST", "CIRCUS ARTIST",
"INFORMATION SECURITY ANALYST", "LAWYER"), contribution_receipt_amount = c(5,
2.7, 2.7, 2, 50, 10, 2.7, 50, 5, 6, 2.7, 50, 100, 27, 2, 5)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16"))
Upvotes: 2