Al14
Al14

Reputation: 1814

Column names and column order

My data.frame:

df
ID Time  a  b  c  d  e
WT    A 28 56 50 60 15
WT    B 54 77 11 67 34
WT    C 53  8 87 62 55
WT    D 30 73 47 82  1
KO    A 24 83 14 17 36
KO    B 91 83 72 41  4
KO    C 79 17 76 21 54
KO    D 41 40 77 49 92

I subset and average groups df:

use_col=3:ncol(df)
mymean<-aggregate(df[, use_col],by= list(df$ID, df$Time),FUN = function(X)mean(X,na.rm=T))

Group.1 Group.2  a  b  c  d  e
  WT       A    51 52 49 29 47
  KO       A     8 17 78 64 96
  WT       B    79  5 45 83 56
  KO       B    53 47 37 99 17
  WT       C    72 38 56 63 40
  KO       C    51  3 30 47 52
  WT       D     3 30 75 53 73
  KO       D    13 72 16 52 13

Why do I loose the first two columns names and how could I keep them? Also, I try to factor mymean data.frame by:

mymean$Group.1=factor(mymean$Group.1, c("WT","KO"))

but it does not work. Thank you for the help.

Actually the mean should not have Time and should be like that:

mymean<-aggregate(df[, use_col],by= list(df$ID, ),FUN = function(X)mean(X,na.rm=T))

However the output is:

ID     a     b     c     d    e
KO 75.75 44.25 61.75 52.50 39.0
WT 56.00 57.00 84.25 58.75 39.5

But it should be the other way round like this:

ID     a     b     c     d    e
WT 56.00 57.00 84.25 58.75 39.5   
KO 75.75 44.25 61.75 52.50 39.0

Upvotes: 0

Views: 92

Answers (2)

akrun
akrun

Reputation: 887711

In aggregate, another option is the formula method, where we can specify all the non-grouping columns with . on the LHS of ~ and the grouping column on the RHS. As we don't need the 'Time' column in the mean, we can use a subset of the dataset, get the mean, specify the additional arguments na.rm=TRUE, na.action=NULL (it is to avoid removing an entire row if there is an NA element) and order the output based on the 'ID' column.

  res <-  aggregate(.~ID, subset(df, select=-Time),
                       FUN=mean, na.rm=TRUE, na.action=NULL)
  res[order(res$ID, decreasing=TRUE),]
  #  ID     a     b     c     d     e
  #2 WT 41.25 53.50 48.75 67.75 26.25
  #1 KO 58.75 55.75 59.75 32.00 46.50

We could also use summarise_each from dplyr as another option, where we group by 'ID' column and get the mean of all other columns except the the 'Time' and order the output using ID.

library(dplyr)
df %>% 
     group_by(ID) %>% 
     summarise_each(funs(mean=mean(., na.rm=TRUE)), -Time)  %>%
      arrange(desc(ID))
#  ID     a     b     c     d     e
#1 WT 41.25 53.50 48.75 67.75 26.25
#2 KO 58.75 55.75 59.75 32.00 46.50

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99361

In your aggregate() call, change the by argument to have named list elements

by = list(ID = df$ID, Time = df$Time)

Then for your updated question, you could use

use_col = 3:ncol(df)
mymean <- aggregate(df[, use_col], by = list(ID = df$ID), mean, na.rm=TRUE)
mymean[order(mymean$ID, decreasing = TRUE), ]
#   ID     a     b     c     d     e
# 2 WT 41.25 53.50 48.75 67.75 26.25
# 1 KO 58.75 55.75 59.75 32.00 46.50

Presumably these values are different from yours because you are using a different data set.

You could also do this with less pain with data.table

library(data.table)
## convert to data table
dt <- as.data.table(df)
## order by decreasing ID
setorderv(dt, "ID", -1L)
## remove the Time column then find the mean of all columns by ID
dt[, lapply(.SD, mean, na.rm = TRUE), by = ID, .SDcols = use_col]
#    ID     a     b     c     d     e
# 1: WT 41.25 53.50 48.75 67.75 26.25
# 2: KO 58.75 55.75 59.75 32.00 46.50

Upvotes: 2

Related Questions