Reputation: 817
I am having trouble with a pesky command I would like to have for an analysis of a summary, for which I'm using the dplyr
package. It's easiest to explain with some example data:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
Two simple summaries are my goal: first, I'd like to summarize just by Date
, with the code seen below. The part that is wrong is the total_balance_sum
calculation, in which I want to sum the balance of each person but only one time for each person. So for instance, the result of my command for Date=1
is total_balance_sum=100
, but what it should be is 150 (add total_balance
of 100 for Jack once to total_balance
of Mary of 50 once). This wrong calculation obviously messes up the final pct
calc.
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
In the second summary (below), I group by both date and birth year, and again am calculating total_balance_sum
incorrectly.
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
What is the correct way to achieve my goal? Clearly the n_distinct
I'm using is only taking one of the values and not summing it properly across names.
Thanks for your help.
Upvotes: 2
Views: 2425
Reputation: 3488
I'm a little unclear on what you may be asking for, but does this do what you'd like?: (just for the first example)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176
Upvotes: 2
Reputation: 24945
For the second example (for the first, just remove the Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)
Upvotes: 1