Nicolas De Jay
Nicolas De Jay

Reputation: 444

Linear time complexity ranking algorithm when the orders are precomputed

I am trying to write an efficient ranking algorithm in C++ but I will present my case in R as it is far easier to understand this way.

> samples_x <- c(4, 10, 9, 2, NA, 3, 7, 1, NA, 8)
> samples_y <- c(5, 7, 9, NA, 1, 4, NA, 8, 2, 10)
> orders_x <- order(samples_x)
> orders_y <- order(samples_y)
> cbind(samples_x, orders_x, samples_y, orders_y)
      samples_x orders_x samples_y orders_y
 [1,]         4        8         5        5
 [2,]        10        4         7        9
 [3,]         9        6         9        6
 [4,]         2        1        NA        1
 [5,]        NA        7         1        2
 [6,]         3       10         4        8
 [7,]         7        3        NA        3
 [8,]         1        2         8       10
 [9,]        NA        5         2        4
[10,]         8        9        10        7

Suppose the above is already precomputed. Performing a simple ranking on each of the sample sets takes linear time complexity (the result is much like the rank function):

> ranks_x <- rep(0, length(samples_x))
> for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i

For a work project I am working on, it would be useful for me to emulate the following behaviour in linear time complexity:

> cc <- complete.cases(samples_x, samples_y)
> ranks_x <- rank(samples_x[cc])
> ranks_y <- rank(samples_y[cc])

The complete.cases function, when given n sets of the same length, returns the indices for which none of the sets contain NAs. The order function returns the permutation of indices corresponding to the sorted sample set. The rank function returns the ranks of the sample set.

How to do this? Let me know if I have provided sufficient information as to the problem in question.

More specifically, I am trying to build a correlation matrix based on Spearman's rank sum correlation coefficient test in a way such that NAs are handled properly. The presence of NAs requires that the rankings be calculated for every pairwise sample set (s n^2 log n); I am trying to avoid that by calculating the orders once for every sample set (s n log n) and use a linear complexity for every pairwise comparison. Is this even doable?

Thanks in advance.

Upvotes: 0

Views: 856

Answers (1)

mcdowella
mcdowella

Reputation: 19601

It looks like, when you work out the rank correlation of two arrays, you want to delete from both arrays elements in positions where either has NA.

You have

for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i

Could you change this to something like

wp <- 0;
for (i in 1:length(samples_x)) {
if ((samples_x[orders_x[i]] == NA) ||
 (samples_y[orders_x[i]] == NA))
 {
   ranks_x[orders_x[i]] <- NA;
 }
 else
 {
   ranks_x[orders_x[i]] <- wp++;
 }
}

Then you could either go along later and compress out the NAs, or hope the correlation subroutine just ignores them.

Upvotes: 1

Related Questions