Reputation: 444
I am trying to write an efficient ranking algorithm in C++ but I will present my case in R as it is far easier to understand this way.
> samples_x <- c(4, 10, 9, 2, NA, 3, 7, 1, NA, 8)
> samples_y <- c(5, 7, 9, NA, 1, 4, NA, 8, 2, 10)
> orders_x <- order(samples_x)
> orders_y <- order(samples_y)
> cbind(samples_x, orders_x, samples_y, orders_y)
samples_x orders_x samples_y orders_y
[1,] 4 8 5 5
[2,] 10 4 7 9
[3,] 9 6 9 6
[4,] 2 1 NA 1
[5,] NA 7 1 2
[6,] 3 10 4 8
[7,] 7 3 NA 3
[8,] 1 2 8 10
[9,] NA 5 2 4
[10,] 8 9 10 7
Suppose the above is already precomputed. Performing a simple ranking on each of the sample sets takes linear time complexity (the result is much like the rank
function):
> ranks_x <- rep(0, length(samples_x))
> for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
For a work project I am working on, it would be useful for me to emulate the following behaviour in linear time complexity:
> cc <- complete.cases(samples_x, samples_y)
> ranks_x <- rank(samples_x[cc])
> ranks_y <- rank(samples_y[cc])
The complete.cases
function, when given n sets of the same length, returns the indices for which none of the sets contain NAs. The order
function returns the permutation of indices corresponding to the sorted sample set. The rank
function returns the ranks of the sample set.
How to do this? Let me know if I have provided sufficient information as to the problem in question.
More specifically, I am trying to build a correlation matrix based on Spearman's rank sum correlation coefficient test in a way such that NAs are handled properly. The presence of NAs requires that the rankings be calculated for every pairwise sample set (s n^2 log n
); I am trying to avoid that by calculating the orders once for every sample set (s n log n
) and use a linear complexity for every pairwise comparison. Is this even doable?
Thanks in advance.
Upvotes: 0
Views: 856
Reputation: 19601
It looks like, when you work out the rank correlation of two arrays, you want to delete from both arrays elements in positions where either has NA.
You have
for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
Could you change this to something like
wp <- 0;
for (i in 1:length(samples_x)) {
if ((samples_x[orders_x[i]] == NA) ||
(samples_y[orders_x[i]] == NA))
{
ranks_x[orders_x[i]] <- NA;
}
else
{
ranks_x[orders_x[i]] <- wp++;
}
}
Then you could either go along later and compress out the NAs, or hope the correlation subroutine just ignores them.
Upvotes: 1