Al. So
Al. So

Reputation: 53

Calculating average based on column values

enter image description here

I have a dataset that has a variable ColumnStart that identifies the first column to calculate an average. I have a second variable ColumnEnd that identifies the last column in that calculation. For the first row I would like to calculate the average of the values from column 5 to column 9. the second row from column 6 to 11, etc.

The output would be:

enter image description here

Here is the updated dput from R:

structure(list(ID = c("AAA", "BBB", "CCC", "DDD"), ShortID = c("452L", 
"3L", "4L", "324L"), Name = c("PS1", "PS2", "PS3", "PS4"), Route = 
c("Internal", 
"External", "Internal", "Internal"), ColumnStart = c(7L, 7L, 
9L, 8L), ColumnEnd = c(9L, 11L, 13L, 10L), Date1 = c(1L, 5L, 
13L, 4L), Date2 = c(2L, 6L, 45L, 3L), Date3 = c(3L, 7L, 23L, 
2L), Date4 = c(4L, 8L, 65L, 1L), Date5 = c(5L, 8L, 34L, 3L), 
Date6 = c(6L, 9L, 23L, 5L), Date7 = c(7L, 6L, 54L, 6L), Date8 = c(7L, 
6L, 1L, 7L), Date9 = c(8L, 9L, 3L, 8L)), .Names = c("ID", 
"ShortID", "Name", "Route", "ColumnStart", "ColumnEnd", "Date1", 
"Date2", "Date3", "Date4", "Date5", "Date6", "Date7", "Date8", 
"Date9"), row.names = c(NA, -4L), class = c("tbl_df", "tbl", 
"data.frame"), spec = structure(list(cols = structure(list(ID = 
structure(list(), class = c("collector_character", 
"collector")), ShortID = structure(list(), class = 
c("collector_character", 
"collector")), Name = structure(list(), class = c("collector_character", 
"collector")), Route = structure(list(), class = c("collector_character", 
"collector")), ColumnStart = structure(list(), class = 
c("collector_integer", 
"collector")), ColumnEnd = structure(list(), class = 
c("collector_integer", 
"collector")), Date1 = structure(list(), class = c("collector_integer", 
"collector")), Date2 = structure(list(), class = c("collector_integer", 
"collector")), Date3 = structure(list(), class = c("collector_integer", 
"collector")), Date4 = structure(list(), class = c("collector_integer", 
"collector")), Date5 = structure(list(), class = c("collector_integer", 
"collector")), Date6 = structure(list(), class = c("collector_integer", 
"collector")), Date7 = structure(list(), class = c("collector_integer", 
"collector")), Date8 = structure(list(), class = c("collector_integer", 
"collector")), Date9 = structure(list(), class = c("collector_integer", 
"collector"))), .Names = c("ID", "ShortID", "Name", "Route", 
"ColumnStart", "ColumnEnd", "Date1", "Date2", "Date3", "Date4", 
"Date5", "Date6", "Date7", "Date8", "Date9")), default = structure(list(), 
class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Upvotes: 2

Views: 166

Answers (2)

divibisan
divibisan

Reputation: 12165

Here's a base R solution that drops non-numeric columns before calculating the mean:

df$ave2 <- apply(df, 1, function(x) {
    y <- as.numeric(x[seq.int(x['ColumnStart'], x['ColumnEnd'])])
    mean(y[!is.na(y)])
    })

df
# A tibble: 4 x 16
  ID    ShortID Name  Route    ColumnStart ColumnEnd Date1 Date2 Date3 Date4 Date5 Date6 Date7 Date8 Date9 Average
  <chr> <chr>   <chr> <chr>          <int>     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>   <dbl>
1 AAA   452L    PS1   Internal           7         9     1     2     3     4     5     6     7     7     8     2  
2 BBB   3L      PS2   External           7        11     5     6     7     8     8     9     6     6     9     6.8
3 CCC   4L      PS3   Internal           9        13    13    45    23    65    34    23    54     1     3    39.8
4 DDD   324L    PS4   Internal           8        10     4     3     2     1     3     5     6     7     8     2  

as.numeric tries to convert the values to numeric. If it cannot, then it returns NA. We then drop NA values and calculate the mean.


Here's a one line version that works the same but uses na.omit to strip out NA values before calculating the mean:

df$Average <- apply(df, 1, function(x) mean(na.omit(as.numeric(x[seq.int(x['ColumnStart'], x['ColumnEnd'])]))))

Upvotes: 3

IceCreamToucan
IceCreamToucan

Reputation: 28705

Another method, not necessarily suggested

rowMeans(df*NA^!(col(df) >= df$ColumnStart & col(df) <= df$ColumnEnd), 
         na.rm = T)
# [1] 3.000000 7.142857 5.000000 3.333333 6.500000

Explanation:

col(df) >= df$ColumnStart & col(df) <= df$ColumnEnd is a matrix which is TRUE at the (i, j) indices matching the ColumnStart, ColumnEnd specification

NA^!(col(df) >= df$ColumnStart & col(df) <= df$ColumnEnd) is a matrix which is 1 at the places where the above matrix is TRUE and NA everywhere else. Mutiplying this by df gives a matrix which is the same as df except all elements whose indices don't match the ColumnStart and ColumnEnd specification are NA

Now we can just take the rowMeans of that, with na.rm = T to get the desired result

Upvotes: 1

Related Questions