Reputation: 85

R: subsetting with respect to a vector

I am trying to figure out how to apply a certain function only to the rows of a matrix with the same entry on the last column, but no luck until now.

My matrix (that we are going to call simply matrix and suppose it is 5x4) looks like this:

d1.1   d1.2   d1.3   NAME1 
d2.1   d2.2   d2.3   NAME1 
d3.1   d3.2   d3.3   NAME2 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2`

I want to perform the summary statistics fun1 on the rows with the same name, in order to get a final matrix that looks like this:

fun1(d1.1, d2.1)   fun1(d1.2, d2.2)   fun1(d1.3, d2.3)   NAME1
fun1(d3.1, d5.1)   fun1(d3.2, d5.2)   fun1(d3.3, d5.3)   NAME2
d4.1               d4.2               d4.3               NAME3.

It is also fine if fun1 is also performed on 'single' rows, i.e.

fun1(d1.1, d2.1)   fun1(d1.2, d2.2)   fun1(d1.3, d2.3)   NAME1
fun1(d3.1, d5.1)   fun1(d3.2, d5.2)   fun1(d3.3, d5.3)   NAME2
fun1(d4.1)         fun1(d4.2)         fun1(d4.3)         NAME3.

I tried with

sapply(subset(matrix[,1:3], as.character(matrix[,4])==as.character(listofnames)), fun1)

but of course it does not work. The current problem is on the subsetting as.character(matrix[,4])==as.character(listofnames), since the two objects have different dimensions, but I am sure this is not the only one.

I tried to look for similar problems but I just found subsetting by specified (numerical) conditions (>3) or by patterns (every group of 7 ordered entries). No luck with factors or characters.

I guess there may be something helpful in the plyr package, but I am not able to make it work. Any suggestion is greatly appreciated!

Update

In my case, fun1=min. The problem has changed meanwhile: while keeping the data grouped by NAME, I would like to get the min of, say, column 1 in each group and to save the whole row where the min is found, like this: suppose d1.1 < d2.1 and d5.1 < d3.1, then the matrix

d1.1   d1.2   d1.3   NAME1 
d2.1   d2.2   d2.3   NAME1 
d3.1   d3.2   d3.3   NAME2 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2

should become

d1.1   d1.2   d1.3   NAME1 
d4.1   d4.2   d4.3   NAME3
d5.1   d5.2   d5.3   NAME2

without loss of the other columns. I tried playing around with the mutate and summarise arguments as suggested, but keep getting warnings and errors (and actually I do not find the help() very helpful at all).

Upvotes: 0

Answers (2)

jeiroje

Reputation: 85

I think I made it!

library(dplyr)

df1 <- df %>%
       group_by(NAMES) %>%
       filter(df, X1 == min(X1))

Minimum returned, no data removed. I found a similar answer on another thread. It would have the problem that it returns all rows if multiple minima are present, but this is not my case.

Upvotes: 0

akrun

Reputation: 887118

You could try:

library(dplyr)
dfSelectSummary <- df %>% 
              group_by(name) %>% 
             summarise_each(funs(mean=mean(., na.rm=TRUE), sd=sd(., na.rm=TRUE),
             median=stats::median(., na.rm=TRUE)), starts_with("X"))

dfSelectSummary[,1:4]
#Source: local data frame [3 x 4]

#   name X1_mean  X2_mean  X3_mean
#1 NAME1   4.250 3.333333 4.888889
#2 NAME2   5.375 4.555556 6.000000
#3 NAME3   6.000 8.000000 9.000000

Or you could use data.table

library(data.table)
DT <- data.table(df, key='name')
nm1 <- colnames(DT[, as.list(summary(X1[!is.na(X1)])), by=name])[-1]
DTSummary <- DT[,  c(Var=list(nm1),
    lapply(.SD, function(x) summary(x[!is.na(x)]))), by=name]

head(DTSummary,8)
#    name     Var    X1    X2     X3    X4    X5
#1: NAME1    Min.  1.00 0.000  0.000 3.000  0.00
#2: NAME1 1st Qu.  2.00 2.000  1.000 3.750  3.25
#3: NAME1  Median  3.50 3.000  6.000 7.500  5.00
#4: NAME1    Mean  4.25 3.333  4.889 6.375  5.00
#5: NAME1 3rd Qu.  6.00 5.000  8.000 8.250  7.25
#6: NAME1    Max. 10.00 7.000 10.000 9.000 10.00
#7: NAME2    Min.  0.00 0.000  0.000 1.000  1.00
#8: NAME2 1st Qu.  3.75 4.000  4.000 3.000  4.25

Another option would be to try summaryBy from doBy

library(doBy)
 summaryBy(.~name, data=df,
    FUN=function(x) c(mean=mean(x, na.rm=TRUE), var= var(x, na.rm=TRUE),
                    median=median(x, na.rm=TRUE)))

If you have a numeric name column, you may not need to convert the matrix

 m1 <- as.matrix(cbind(name=as.numeric(df$name), df[,-1]))
 by(m1[,-1], m1[,1], FUN=summary)

data

set.seed(45)
df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
        matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5))

Update

If you need the results in the long form and would like to keep the comments column, you could use mutate_each

 df1 <- df %>% 
           group_by(name) %>% 
           mutate_each(funs(min=min(., na.rm=TRUE)), starts_with("X"))

 colnames(df1)[2:6] <- paste0("Min", colnames(df1)[2:6])
 head(df1,3)
 #Source: local data frame [3 x 7]
 #Groups: name

 #   name MinX1 MinX2 MinX3 MinX4 MinX5 Comments
 #1 NAME2     0     0     0     1     1     Fair
 #2 NAME1     1     0     0     3     0      Bad
 #3 NAME1     1     0     0     3     0     Good

newdata

  set.seed(45)
  df <- data.frame(name=sample(paste0("NAME", 1:3),20, replace=TRUE),
          matrix(sample(c(NA, 0:10), 20*5, replace=TRUE), ncol=5), 
             Comments=sample(c("Good", "Fair", "Bad", "ugly"), 20, replace=TRUE))

Upvotes: 2

R: subsetting with respect to a vector

Answers (2)

data

Update

newdata

Related Questions