Daniel Schultz
Daniel Schultz

Reputation: 320

Rowwise finding the lowest values of a matrix or a data frame (ordering)

Objective: Finding the lowest n values of each row from a matrix or data frame. For this example we want to find the 3 lowest values of each row. We want to return a matrix with

rowname | colname_min | value_min | colname_min2 | value_min2 | colname_min3 | value_min3

Point of departure: I modified the answer from this question: R getting the minimum value for each row in a matrix, and returning the row and column name

Here is my modified code:

    df<-data.frame(matrix(data=round(x=rnorm(100,10,1),digits=3),nrow=10),
               row.names=c("A","B","C","D","E","F","G","H","I","J"))
    colnames(df)<-c("AD","BD","CD","DD","ED","FD","GD","HD","ID","JD")

   result <- t(sapply(seq(nrow(df)), function(i) {
  j <- apply(df, 1, function(x){order(x, decreasing=F)[1:3]})
  c(rownames(df)[i], colnames(df)[j[1,i]], as.numeric(df[i,j[1,i]]),
    colnames(df)[j[2,i]], as.numeric(df[i,j[2,i]]),
    colnames(df)[j[3,i]], as.numeric(df[i,j[3,i]]))
}))

This is working, and it is working fine for the small example data.frame. However, the data.frame I am working with has 200,000 rows and 300 columns. On my machine the code now runs for ~1 hour and is still working. Any ideas how to optimize the code? I was thinking dplyr, but couldn't find a solution. Help is greatly appreciated.

Upvotes: 2

Views: 1044

Answers (1)

Marat Talipov
Marat Talipov

Reputation: 13304

You can use this base R solution, which orders each row and picks n.min lowest values and their indices:

Sample Data

N <- 5
n.min <- 2

set.seed(1)
m <- matrix(runif(N^2),N)
rownames(m) <- letters[1:N]
colnames(m) <- LETTERS[1:N]

#           A          B         C         D         E
# a 0.2655087 0.89838968 0.2059746 0.4976992 0.9347052
# b 0.3721239 0.94467527 0.1765568 0.7176185 0.2121425
# c 0.5728534 0.66079779 0.6870228 0.9919061 0.6516738
# d 0.9082078 0.62911404 0.3841037 0.3800352 0.1255551
# e 0.2016819 0.06178627 0.7698414 0.7774452 0.2672207

Code

f <- function(rw) {
  O <- order(rw)[1:n.min]
  rbind(O,rw[O])
}

result <- t(apply(m,1,f))

Output for the sample data

#   [,1]       [,2] [,3]      [,4]
# a    3 0.20597457    1 0.2655087
# b    3 0.17655675    5 0.2121425
# c    1 0.57285336    5 0.6516738
# d    5 0.12555510    4 0.3800352
# e    2 0.06178627    1 0.2016819

Update

If you'd like to replace column numbers by the column names, you could additionaly do:

d <- as.data.frame(result)
d[,c(T,F)] <- colnames(m)[unlist(d[,c(T,F)])]

Note that it this way, you avoid coercion of numbers to strings and the subsequent backward conversion to the numeric format, which might be costly for large data sets.

Upvotes: 3

Related Questions