Amy
Amy

Reputation: 113

Combining & totalling rows in R

I have the below dataset, with the variables as follows:

member_id - an id number for each member

year - the year in question

gender - binary variable, 0 is male, 1 is female

party - the party of the member

Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't

house_start - the date the member became an MP

Year.Entered - the year they became an MP

Years.in.parliament - how many years it has been since they were first elected

Edu - the amount of time the MP has participated in debates related to education in the given year.

  member_id year gender                   party Leadership house_start Year.Entered Years.in.parliament Edu
1       386 1997      0            Conservative      FALSE  03/05/1979         1979                 18    7
2        37 1997      0                  Labour      FALSE  03/05/1979         1979                 18   10
3        47 1997      0                  Labour       TRUE  09/06/1983         1983                 14  157
4       408 1997      0            Conservative       TRUE  03/05/1979         1979                 18   48
5        15 1997      1        Liberal Democrat      FALSE  09/06/1983         1983                 14    3
6        15 1997      1        Liberal Democrat       TRUE  09/06/1983         1983                 14    9
  

As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:

  member_id year gender                   party Leadership house_start Year.Entered Years.in.parliament Edu
1       386 1997      0            Conservative      FALSE  03/05/1979         1979                 18    7
2        37 1997      0                  Labour      FALSE  03/05/1979         1979                 18   10
3        47 1997      0                  Labour       TRUE  09/06/1983         1983                 14  157
4       408 1997      0            Conservative       TRUE  03/05/1979         1979                 18   48
5        15 1997      1        Liberal Democrat       TRUE  09/06/1983         1983                 14   12
  

I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!

Upvotes: 1

Views: 67

Answers (3)

hello_friend
hello_friend

Reputation: 5788

From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).

library(dplyr)
df %>% 
  group_by(member_id, year) %>% 
  mutate(Edu = sum(Edu, na.rm = TRUE)) %>% 
  slice(which.max(Leadership)) %>% 
  ungroup()

Alternatively we can use top_n function (which yields the same result):

df %>% 
  group_by(member_id, year) %>% 
  mutate(Edu = sum(Edu, na.rm = TRUE)) %>% 
  top_n(1, Leadership) %>% 
  ungroup()

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388907

For each group you can select the rows where there is only one row or row where Leadership is TRUE.

library(dplyr)

df %>%
  group_by(member_id, year, gender, party) %>%
  mutate(Edu = sum(Edu)) %>%
  filter(n() == 1 | Leadership)

Upvotes: 0

akrun
akrun

Reputation: 887048

We can do a group by sum and arrange and slice the first row

library(dplyr)
df1 %>%
     group_by(member_id, year, gender, party) %>%
     mutate(Edu = sum(Edu)) %>% 
     arrange(party, desc(Leadership)) %>% 
     slice(1)

Upvotes: 3

Related Questions