Averaging different rows of a data frame based on a shared column value

Question

I have a large data frame that looks like this

iid Q1  Q2  Q16 Q17 Q18 pop a   b   c
Sample1 0   0   0   0   0   First   A   br1 Alpha1
Sample2 0   0   0   0   0   First   A   br1 Alpha1
Sample3 0   0   0   0   0   Second  B   br1 Beta
Sample4 0   0   0   0   0   Second  B   br1 Beta
Sample5 0   0   0   0   0   Third   C   br2 Gamma
Sample6 0   0   0   0   0   Fourth  C   br2 Delta
Sample7 0   0   0   0   0   First   A   br1 Alpha2
Sample8 0   0   0   0   0   First   A   br1 Alpha2

Those 0s are stand-ins for numbers between 0 and 1, which add up to 1. For a table this size, I could easily do it by hand, but my actual data frame has #### samples (iid) from ### populations (pop) with 18 Q columns. The samples are not sorted by population order (though I could sort it if it would simplify the coding).

For each population I am looking to get the average value for all 18 Q columns, so I can get an output like this:

Q1  Q2  Q16 Q17 Q18 a   b   c
0   0   0   0   0   First   A   br1
0   0   0   0   0   Fourth  C   br2
0   0   0   0   0   Second  B   br1
0   0   0   0   0   Third   C   br2

as well as output like this (when sorting by c instead)

Q1  Q2  Q16 Q17 Q18 pop a   b   c
0   0   0   0   0   First   A   br1 Alpha1
0   0   0   0   0   First   A   br1 Alpha2
0   0   0   0   0   Second  B   br1 Beta
0   0   0   0   0   Fourth  C   br2 Delta
0   0   0   0   0   Third   C   br2 Gamma

I was wondering how could I get R to do something like this?

akrun · Accepted Answer

We can use dplyr. We specify the grouping columns in the group_by and get the mean of the 'Q' columns with summarise_each. In matches we can use regex patterns to match the column names. In this case I am matching column names that start with 'Q' (^Q) followed by numbers (\d+).

library(dplyr)
df1 %>% 
      group_by(pop, a, b) %>% 
      summarise_each(funs(mean), matches('^Q\d+'))
#     pop      a         b Q1 Q2 Q16 Q17 Q18
#1  First      A       br1  0  0   0   0   0
#2 Fourth      C       br2  0  0   0   0   0
#3 Second      B       br1  0  0   0   0   0
#4  Third      C       br2  0  0   0   0   0

For the second case, we need to add the 'c' also in the group_by

df1 %>%
   group_by(pop, a, b, c) %>%
   summarise_each(funs(mean), matches('^Q\d+'))
#     pop      a         b          c Q1 Q2 Q16 Q17 Q18
#1  First      A       br1     Alpha1  0  0   0   0   0
#2  First      A       br1     Alpha2  0  0   0   0   0
#3 Fourth      C       br2      Delta  0  0   0   0   0
#4 Second      B       br1       Beta  0  0   0   0   0
#5  Third      C       br2      Gamma  0  0   0   0   0

Averaging different rows of a data frame based on a shared column value

Answers (1)

Related Questions