Why is Matlab signrank function returns the same signed rank statistic values when flipping the signs of the data points?

Question

I have a sequence of data points stored in vector x. I use signrank(x) to do sign rank test.

Matlab says

When you use the test for one sample, then W is the sum of the ranks of positive differences between the observations and the hypothesized median value M0 (which is 0 when you use signrank(x) and m when you use signrank(x,m)).

So I think the result signrank(x) and signrank(-x) should be different. But I have experienced some examples, and I get the same sign rank statistic value for x and -x. How is the signed rank statistic defined in Matlab signrank function?

Thanks!

Stuart · Accepted Answer

Thanks! Actually the statistic is the minimum between the sum of the ranks of positive differences and the sum of the ranks of negative differences. I don't understand why it takes the minimum. Do you?

Interesting question, and thanks for the link to the matlab code. Yes that had me scratching my head for a few minutes too, they certainly do it a curly manner, presumably for computational efficiency. Surprisingly however it does actually do the signed rank, exactly as posted previously.

Here's how it works (I've pasted the relevant few lines of code below for reference).

Let me denote P as the sum of all positive ranks (ranks corresponding to positive scores), N as the sum of all the negative ranks, and finally A as the absolute sum of all ranks. Clearly A = P + N (btw. Note that what I've denoted as "N" is the variable "w" in the actual code.)

By arithmetic series, A = n*(n+1)/2. So as you said, the line min(w,(n+1)*n/2-w) is actually returning either N or P (=A-N), whichever is minimum.

But now look at the last line of the code I pasted below. The numerator is therefore min(N,P) - A/2.

Now if N is the minimum this returns N-(P+N)/2, which equals -(P - N)/2.

However if P is the minimum this returns P-(P+N)/2, which equals -(N - P)/2.

So in either case it really is returning the (negative of) the absolute difference of the positive and negative rank sums, precisely as previously posted in the simplified form of,

| Sum{ sign(Xi) rank(|Xi|) } |

BTW. The reason why they use the negative of the absolute difference there is simply that it saves them from having to find the complementary cfd later.

Snippet from signrank code for reference.

w = sum(tierank(neg));
w = min(w, n*(n+1)/2-w);
...
z = (w-n*(n+1)/4) / sqrt((n*(n+1)*(2*n+1) - tieadj)/24);

Edit:

Why does it take absolute value? For z to have asymptotic normality, isn't it that there should be no absolute value taken?

My understanding of it is that's it's not actually normal, it's "folded normal". That is, folded into the positive half plane. That's why the p-value is calculated as,

p = 2*(1 - normcdf(z,0,1));

(Aside). I know that in the actual code they use the negative of "z" to avoid requiring the cdf-complement there, but it's the same thing.

The p value is multiplied by two to account for the folded distribution. Conveniently, this also works out exactly the same as calling it a "two tailed" p value.

Think for a moment about what would happen if we didn't use the absolute value here. Say we took P-N and N was greater than P. In this case the p value, 2*(1-normcdf(z,0,1)), would evaluate to greater than one, so that can't be a good idea. :)

Why is Matlab signrank function returns the same signed rank statistic values when flipping the signs of the data points?

Answers (2)

Related Questions