Reputation: 85
I am trying to figure out how to apply a certain function only to the rows of a matrix with the same entry on the last column, but no luck until now.
My matrix (that we are going to call simply matrix
and suppose it is 5x4) looks like this:
d1.1 d1.2 d1.3 NAME1
d2.1 d2.2 d2.3 NAME1
d3.1 d3.2 d3.3 NAME2
d4.1 d4.2 d4.3 NAME3
d5.1 d5.2 d5.3 NAME2`
I want to perform the summary statistics fun1
on the rows with the same name, in order to get a final matrix that looks like this:
fun1(d1.1, d2.1) fun1(d1.2, d2.2) fun1(d1.3, d2.3) NAME1
fun1(d3.1, d5.1) fun1(d3.2, d5.2) fun1(d3.3, d5.3) NAME2
d4.1 d4.2 d4.3 NAME3.
It is also fine if fun1
is also performed on 'single' rows, i.e.
fun1(d1.1, d2.1) fun1(d1.2, d2.2) fun1(d1.3, d2.3) NAME1
fun1(d3.1, d5.1) fun1(d3.2, d5.2) fun1(d3.3, d5.3) NAME2
fun1(d4.1) fun1(d4.2) fun1(d4.3) NAME3.
I tried with
sapply(subset(matrix[,1:3], as.character(matrix[,4])==as.character(listofnames)), fun1)
but of course it does not work. The current problem is on the subsetting as.character(matrix[,4])==as.character(listofnames)
, since the two objects have different dimensions, but I am sure this is not the only one.
I tried to look for similar problems but I just found subsetting by specified (numerical) conditions (>3) or by patterns (every group of 7 ordered entries). No luck with factors or characters.
I guess there may be something helpful in the plyr
package, but I am not able to make it work. Any suggestion is greatly appreciated!
Update
In my case, fun1=min
. The problem has changed meanwhile: while keeping the data grouped by NAME
, I would like to get the min of, say, column 1 in each group and to save the whole row where the min is found, like this: suppose d1.1 < d2.1
and d5.1 < d3.1
, then the matrix
d1.1 d1.2 d1.3 NAME1
d2.1 d2.2 d2.3 NAME1
d3.1 d3.2 d3.3 NAME2
d4.1 d4.2 d4.3 NAME3
d5.1 d5.2 d5.3 NAME2
should become
d1.1 d1.2 d1.3 NAME1
d4.1 d4.2 d4.3 NAME3
d5.1 d5.2 d5.3 NAME2
without loss of the other columns.
I tried playing around with the mutate
and summarise
arguments as suggested, but keep getting warnings and errors (and actually I do not find the help()
very helpful at all).
Upvotes: 0
Views: 123
Reputation: 85
I think I made it!
library(dplyr)
df1 <- df %>%
group_by(NAMES) %>%
filter(df, X1 == min(X1))
Minimum returned, no data removed. I found a similar answer on another thread. It would have the problem that it returns all rows if multiple minima are present, but this is not my case.
Upvotes: 0
Reputation: 887118
You could try:
library(dplyr)
dfSelectSummary <- df %>%
group_by(name) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE), sd=sd(., na.rm=TRUE),
median=stats::median(., na.rm=TRUE)), starts_with("X"))
dfSelectSummary[,1:4]
#Source: local data frame [3 x 4]
# name X1_mean X2_mean X3_mean
#1 NAME1 4.250 3.333333 4.888889
#2 NAME2 5.375 4.555556 6.000000
#3 NAME3 6.000 8.000000 9.000000
Or you could use data.table
library(data.table)
DT <- data.table(df, key='name')
nm1 <- colnames(DT[, as.list(summary(X1[!is.na(X1)])), by=name])[-1]
DTSummary <- DT[, c(Var=list(nm1),
lapply(.SD, function(x) summary(x[!is.na(x)]))), by=name]
head(DTSummary,8)
# name Var X1 X2 X3 X4 X5
#1: NAME1 Min. 1.00 0.000 0.000 3.000 0.00
#2: NAME1 1st Qu. 2.00 2.000 1.000 3.750 3.25
#3: NAME1 Median 3.50 3.000 6.000 7.500 5.00
#4: NAME1 Mean 4.25 3.333 4.889 6.375 5.00
#5: NAME1 3rd Qu. 6.00 5.000 8.000 8.250 7.25
#6: NAME1 Max. 10.00 7.000 10.000 9.000 10.00
#7: NAME2 Min. 0.00 0.000 0.000 1.000 1.00
#8: NAME2 1st Qu. 3.75 4.000 4.000 3.000 4.25
Another option would be to try summaryBy
from doBy
library(doBy)
summaryBy(.~name, data=df,
FUN=function(x) c(mean=mean(x, na.rm=TRUE), var= var(x, na.rm=TRUE),
median=median(x, na.rm=TRUE)))
If you have a numeric
name column, you may not need to convert the matrix
m1 <- as.matrix(cbind(name=as.numeric(df$name), df[,-1]))
by(m1[,-1], m1[,1], FUN=summary)
set.seed(45)
df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5))
If you need the results in the long
form and would like to keep the comments
column, you could use mutate_each
df1 <- df %>%
group_by(name) %>%
mutate_each(funs(min=min(., na.rm=TRUE)), starts_with("X"))
colnames(df1)[2:6] <- paste0("Min", colnames(df1)[2:6])
head(df1,3)
#Source: local data frame [3 x 7]
#Groups: name
# name MinX1 MinX2 MinX3 MinX4 MinX5 Comments
#1 NAME2 0 0 0 1 1 Fair
#2 NAME1 1 0 0 3 0 Bad
#3 NAME1 1 0 0 3 0 Good
set.seed(45)
df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5),
Comments=sample(c("Good", "Fair", "Bad", "ugly"), 20, replace=TRUE))
Upvotes: 2