EVie
EVie

Reputation: 23

Average of Columns by Unique ID in R

I would like to average columns in a data set based on a unique identifier. I do not know ahead of time how many columns I will have for each unique identifier or what order they will come in. The unique IDs are all known before hand and are lists of weeks. I have found solutions for regular patterns but not solutions for using the actual column headers to sort out the average. Thanks for any and all help.

I present the original data and desired result. In the example there are only 2 unique IDs

x = read.table(text = "
  site     wk1  wk2  wk1  wk1
    1       2    4    6    8
    2      10   20   30   40
    3       5   NA    2    3
    4     100  100   NA   NA", 
sep = "", header = TRUE)

x

desired.outcome = read.table(text = "
  site    wk1avg  wk2avg
    1      3.3      4
    2      26.6    20
    3      3.3     NA
    4       NA     100", 
sep = "", header = TRUE)

Upvotes: 1

Views: 306

Answers (2)

alistaire
alistaire

Reputation: 43344

Here's a tidyr and dplyr approach:

library(dplyr)
library(tidyr)

x %>% gather(wk, val, -site) %>%           # gather wk* columns into key-value pairs
    extract(wk, 'wk', '(wk\\d+).*?') %>%       # trim suffixes added by read.table
    group_by(site, wk) %>% 
    summarise(mean_val = mean(val)) %>%    # calculate grouped means
    spread(wk, mean_val)                   # spread back into wk* columns
# Source: local data frame [4 x 3]
# Groups: site [4]
# 
#    site       wk1   wk2
#   (int)     (dbl) (dbl)
# 1     1  5.333333     4
# 2     2 26.666667    20
# 3     3  3.333333    NA
# 4     4        NA   100

Upvotes: 1

eipi10
eipi10

Reputation: 93851

If your original data file has duplicated column names, read.table will change them so all the columns have unique values (as you can see by checking x in your example after it's loaded). In fact, the code below depends on that happening, because melt will drop columns with duplicated names. Then we use mutate to remove the extra text added by read.table to de-duplicate the column names so that we can group properly by week.

library(reshape2)
library(dplyr)

x %>% melt(id.var="site") %>%   # Convert to long format
  mutate(variable = gsub("\\..*", "", variable)) %>%  # "re-duplicate" original column names
  group_by(site, variable) %>%     
  summarise(mn = mean(value)) %>%
  dcast(site ~ variable)

  site       wk1 wk2
1    1  5.333333   4
2    2 26.666667  20
3    3  3.333333  NA
4    4        NA 100

Upvotes: 3

Related Questions