Z_D
Z_D

Reputation: 817

R: dplyr summarize, sum only values of uniques

I am having trouble with a pesky command I would like to have for an analysis of a summary, for which I'm using the dplyr package. It's easiest to explain with some example data:

structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), 
    Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George", 
    "Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L, 
    1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L), 
    Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L, 
    100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L, 
    20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year", 
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA, 
-9L))

Two simple summaries are my goal: first, I'd like to summarize just by Date, with the code seen below. The part that is wrong is the total_balance_sum calculation, in which I want to sum the balance of each person but only one time for each person. So for instance, the result of my command for Date=1 is total_balance_sum=100, but what it should be is 150 (add total_balance of 100 for Jack once to total_balance of Mary of 50 once). This wrong calculation obviously messes up the final pct calc.

example_data %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary

In the second summary (below), I group by both date and birth year, and again am calculating total_balance_sum incorrectly.

example_data %>% 
  group_by(Date,Birth.Year) %>% 
  summarise(
    total_people=n_distinct(Name),
    total_loan_exposures=n(),

    special_sum=sum(Special_Balance,na.rm=TRUE),
    total_balance_sum=sum(Total_Balance[n_distinct(Name)]), 
    total_pct=special_sum/total_balance_sum

  ) -> example_summary_birthyear

What is the correct way to achieve my goal? Clearly the n_distinct I'm using is only taking one of the values and not summing it properly across names.

Thanks for your help.

Upvotes: 2

Views: 2425

Answers (2)

Andrew Taylor
Andrew Taylor

Reputation: 3488

I'm a little unclear on what you may be asking for, but does this do what you'd like?: (just for the first example)

example_data %>% 
  group_by(Date, Name) %>% 
    summarise(
      total_loan_exposures=n(),
      total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
      total_balance_sumPerson=Total_Balance[1])%>% 
  ungroup() %>% 
  group_by(Date) %>% 
  summarise(
    total_people=n(),
    total_loan_exposures=sum(total_loan_exposures),
    special_sum=sum(total_SpecialPerson,na.rm=TRUE),
    total_balance_sum=sum(total_balance_sumPerson)) %>% 
  mutate(total_pct=(special_sum/total_balance_sum))-> example_summary

> example_summary
Source: local data frame [3 x 6]

    Date total_people total_loan_exposures special_sum total_balance_sum  total_pct
    1    1            2                    3          80               150 0.53333333
    2    2            2                    4          32               220 0.14545455
    3    3            2                    2         101              1700 0.05941176

Upvotes: 2

jeremycg
jeremycg

Reputation: 24945

For the second example (for the first, just remove the Birth.Year):

library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
                 mutate(special_sum = sum(Special_Balance),
                        total_loan_exposure = n( )) %>%
                 distinct(Name, Total_Balance) %>%
                 summarise(Total_balance_sum = sum(Total_Balance),
                           special_sum = special_sum[1],
                           total_people = n(),
                           total_loan_exposure = total_loan_exposure[1],
                           special_sum/Total_balance_sum)

Upvotes: 1

Related Questions