Philipp Chapkovski
Philipp Chapkovski

Reputation: 2059

how to make cross-row calculations in dplyr?

there is a data frame that contains nested info. Let' say number of pupils, number of pupils in class A and number of pupils in class B in each school. So that pupils = n.pupilsA + n.pupilsB + other_pupils

a <- data.frame(
  city = c(rep('New York',3), rep('Washington',3)),
  n = c(5, 2, 1, 5, 2, 1),
  name = c(
    'pupils',
    'classA',
    'classB',
    'pupils',
    'classA',
    'classB'
  )
)

output:

        city n   name
1   New York 5 pupils
2   New York 2 classA
3   New York 1 classB
4 Washington 5 pupils
5 Washington 2 classA
6 Washington 1 classB

Is there a smart way (using dplyr, presumably) to make a group operation that will add into each group 'other' that will be difference between 'pupils' and 'pupils - class A' + 'pupils - class B'. So the result would be something like:

        city   type npupils
1   New York classA       2
2   New York classB       1
3   New York pupils       5
4   New York  other       2
5 Washington classA       2
6 Washington classB       1
7 Washington pupils       5
8 Washington  other       2

The only way I thought might work was to spread it, calculate difference between columns, and gather it back using tidyr:

a %>%
  spread(name, n) %>%
  mutate(other = pupils - classA - classB) %>%
  gather(type, npupils, c('classA', 'classB', 'pupils', 'other')) %>%
  arrange(city)

which works, but I wonder if there is nicer way?

Upvotes: 1

Views: 113

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388962

We can create a summarized dataframe and bind it to original one. For each city we calculate n by subtracting the value of n where name == 'pupils' by remaining values in the group and create a name column as "Other" and add these rows to the original dataframe using bind_rows.

library(dplyr)

bind_rows(a, a %>%
              group_by(city)%>%
              summarise(n = n[name == 'pupils'] - sum(n[name != 'pupils']), 
                       name = "Other")) %>%
arrange(city)


#        city n   name
#1   New York 5 pupils
#2   New York 2 classA
#3   New York 1 classB
#4   New York 2  Other
#5 Washington 5 pupils
#6 Washington 2 classA
#7 Washington 1 classB
#8 Washington 2  Other

Note - Here I am assuming you have only one entry of "pupils" for each city or else we can use which.max to get the first entry.

Upvotes: 5

Related Questions