krlmlr
krlmlr

Reputation: 25454

Column-wise sort or top-n of matrix

I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:

> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
     [,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
 [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
     [,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92

But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.

Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.

Upvotes: 4

Views: 1755

Answers (5)

krlmlr
krlmlr

Reputation: 25454

I have put down a quick testing framework for the solutions proposed so far.

library(rbenchmark)

sort.q <- function(m) {
  sort(m, method='quick')
}
sort.p <- function(m) {
  mm <- sort(m, partial=TOP)[1:TOP]
  sort(mm)
}

sort.all.g <- function(f) {
  function(m) {
    o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
    matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
  }
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)

apply.sort.g <- function(f) {
  function(m) {
    apply(m, 2, f)[1:TOP,]
  }
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)

bb <- NULL

SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5

for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
  for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
    print(c(SIZE, TOP))
    TOP <- min(TOP, SIZE)
    m <- matrix(runif(SIZE*SIZE), floor(SIZE))
    if (SIZE < 1000) {
      mr <- apply.sort(m)
      stopifnot(apply.sort.q(m) == mr)
      stopifnot(apply.sort.p(m) == mr)
      stopifnot(sort.all(m) == mr)
      stopifnot(sort.all.q(m) == mr)
    }

    b <- benchmark(apply.sort(m),
                   apply.sort.q(m),
                   apply.sort.p(m),
                   sort.all(m),
                   sort.all.q(m),
                   columns= c("test", "elapsed", "relative",
                              "user.self", "sys.self"),
                   replications=1,
                   order=NULL)
    b$SIZE <- SIZE
    b$TOP <- TOP
    b$test <- factor(x=b$test, levels=b$test)

    bb <- rbind(bb, b)
  }
}

ftable(xtabs(user.self ~ SIZE+test+TOP, bb))

The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.

Please feel free to add your own sorting routine :-)

                      TOP      10      31     100     316
SIZE  test                                               
31    apply.sort(m)         0.004   0.012   0.000   0.000
      apply.sort.q(m)       0.008   0.016   0.000   0.000
      apply.sort.p(m)       0.008   0.020   0.000   0.000
      sort.all(m)           0.000   0.008   0.000   0.000
      sort.all.q(m)         0.000   0.004   0.000   0.000
100   apply.sort(m)         0.012   0.016   0.028   0.000
      apply.sort.q(m)       0.016   0.016   0.036   0.000
      apply.sort.p(m)       0.020   0.020   0.040   0.000
      sort.all(m)           0.000   0.004   0.008   0.000
      sort.all.q(m)         0.004   0.004   0.004   0.000
316   apply.sort(m)         0.060   0.060   0.056   0.060
      apply.sort.q(m)       0.064   0.060   0.060   0.072
      apply.sort.p(m)       0.064   0.068   0.108   0.076
      sort.all(m)           0.016   0.016   0.020   0.024
      sort.all.q(m)         0.020   0.016   0.024   0.024
1000  apply.sort(m)         0.356   0.276   0.276   0.292
      apply.sort.q(m)       0.348   0.316   0.288   0.296
      apply.sort.p(m)       0.256   0.264   0.276   0.320
      sort.all(m)           0.268   0.244   0.213   0.244
      sort.all.q(m)         0.260   0.232   0.200   0.208
3162  apply.sort(m)         1.997   1.948   2.012   2.108
      apply.sort.q(m)       1.916   1.880   1.892   1.901
      apply.sort.p(m)       1.300   1.316   1.376   1.544
      sort.all(m)           2.424   2.452   2.432   2.480
      sort.all.q(m)         2.188   2.184   2.265   2.244
10000 apply.sort(m)        18.193  18.466  18.781  18.965
      apply.sort.q(m)      15.837  15.861  15.977  16.313
      apply.sort.p(m)       9.005   9.108   9.304   9.925
      sort.all(m)          26.030  25.710  25.722  26.686
      sort.all.q(m)        23.341  23.645  24.010  24.073
31622 apply.sort(m)       201.265 197.568 196.181 196.104
      apply.sort.q(m)     163.190 160.810 158.757 160.050
      apply.sort.p(m)      82.337  81.305  80.641  82.490
      sort.all(m)         296.239 288.810 289.303 288.954
      sort.all.q(m)       260.872 249.984 254.867 252.087

Upvotes: 4

Tim P
Tim P

Reputation: 1383

They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).

topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
    thresh = thresh + calcsd
    subs = which(vec > thresh)
}
while (length(subs) < topn) {
    thresh = thresh - calcsd
    subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]

The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.

Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.

Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.

My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.

What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!

Cheers :)

Upvotes: 1

Andrie
Andrie

Reputation: 179428

R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine

set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)

system.time(sm <- apply(m, 2, sort))
   user  system elapsed 
   2.62    0.14    2.79 

The first 5 columns:

sm[1:15, 1:5]
              [,1]         [,2]         [,3]         [,4]         [,5]
 [1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
 [2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
 [3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
 [4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
 [5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
 [6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
 [7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
 [8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
 [9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03

Upvotes: 3

Tim P
Tim P

Reputation: 1383

Does

apply(m, 2, sort)

do the job? :)

Or for top-10, say, use:

apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})

Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.

Upvotes: 3

Gavin Simpson
Gavin Simpson

Reputation: 174813

If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.

Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.

Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.

Upvotes: 4

Related Questions