DirkLX
DirkLX

Reputation: 1337

Calculate Mean in R on multiple columns for values greater X

I have the following data and would like to calculate the mean values for t1-t5, grouped by iid, for every cid.

  1. I only want to calculate the mean for values > 0
  2. Ideally I don't want name all the fields when doing mean e.g. mean(t1), mean(t2). This is because in my real case I have 200+ fields.

Sample data:

cid   iid   t1  t2  t3  t4  t5
4503    2   4   5   5   6   7
4503    2   7   5   3   2   8
841     2   9   2   1   5   7
2134    1   6   9   8   2   1

Here is my code so far. Can somebody please help me to finish it off. Thanks in advance.

library(dplyr)
test <- read.csv("~/Documents/R-SCRIPTS/DATA/test.csv", sep=";")

t <- test %>% 
  group_by(cid, iid) %>%
  select(t1:t5) %>%
  summarise(t1 = mean(t1, na.rm = TRUE), 
            t2 = mean(t2,na.rm = TRUE), 
            t3 = mean(t3,na.rm = TRUE), 
            t4 = mean(t4,na.rm = TRUE), 
            t5 = mean(t5,na.rm = TRUE) 
            ) 

Upvotes: 2

Views: 445

Answers (2)

talat
talat

Reputation: 70336

If I understand correctly, you could simply use:

test %>% 
  group_by(cid, iid) %>% 
  summarise_each(funs(mean(.[.>0], na.rm = TRUE)), t1:t5)
#Source: local data frame [3 x 7]
#Groups: cid [?]
#
#    cid   iid    t1    t2    t3    t4    t5
#  (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
#1   841     2   9.0     2     1     5   7.0
#2  2134     1   6.0     9     8     2   1.0
#3  4503     2   5.5     5     4     4   7.5

Upvotes: 1

Conta
Conta

Reputation: 585

Is this what you want? I do not use other packages, but use the colMeans(). Here is an example:

The data looks like (a short copy of your example)

 > mydata
      iid t1 t2 t3
    1   2  4  5  5
    2   2  7  5  3
    3   2  9  2  1
    4   1  6  9  8

The code:

id_list <- unique(mydata$iid) # get the id
result <- matrix(nrow=0, ncol=4) # create a matrix to store result
colnames(result) <- colnames(mydata) # name the columns of the matrix
for (i in 1:length(id_list)){
   uid <- id_list[i]
   # for each id, calculate the column averages
   average <- unname(colMeans(mydata[mydata$iid==uid,2:4])) 
   # write to the result
   result <- rbind(result, c(uid, average))
}
result

The results looks like:

    > result
     iid       t1 t2 t3
[1,]   2 6.666667  4  3
[2,]   1 6.000000  9  8

For your problem, you need to change the colMeans(mydata[mydata$iid==uid,2:4]) to colMeans(mydata[mydata$iid==uid,2:201]), which are your column index you want the mean. And change ncol in matrix(nrow=0, ncol=4) corresponding the result data you want.

For values < 0, you can first covert the negative values into NA, mydata[,2:4][mydata[,2:4]<0]<-NA. Then add na.rm=TRUE in colMeans().

Update the same example:

> mydata
  iid t1 t2 t3
1   2  4  5  5
2   2 -2  5  3
3   2  9  2  1
4   1  6  9 -1

Code:

mydata[,2:4][mydata[,2:4]<0]<-NA
id_list <- unique(mydata$iid)
result <- matrix(nrow=0, ncol=4)
colnames(result) <- colnames(mydata)
for (i in 1:length(id_list)){
   uid <- id_list[i]
   average <- unname(colMeans(mydata[mydata$iid==uid,2:4], na.rm=TRUE))
   result <- rbind(result, c(uid, average))
}
result

Result:

> result
     iid  t1 t2  t3
[1,]   2 6.5  4   3
[2,]   1 6.0  9 NaN

Upvotes: 0

Related Questions