Reputation: 45
I am trying to compute the Spearman rank correlation coefficient between two ranked lists using scipy.stats.spearmanr and manually using the formula:
However, I am getting significantly different results.
I have two lists representing the top 10 similar users identified using two different methods:
Method 1 Rankings:
User ID: [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
Rank: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Method 2 Rankings:
User ID: [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]
Rank: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I then applied Spearman’s rank correlation using scipy.stats.spearmanr:
from scipy.stats import spearmanr
method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]
coef, _ = spearmanr(method1, method2)
print(f"Spearman coefficient: {coef}")
Spearman coefficient: 0.2727
Manual Calculation: To verify, I computed ρ manually using the common users between both lists.
User ID | Rank (Method 1) | Rank (Method 2) | (d_i = R_1 - R_2) | (d_i^2) |
---|---|---|---|---|
301 | 1 | 1 | 0 | 0 |
597 | 2 | 7 | -5 | 25 |
477 | 4 | 2 | 2 | 4 |
57 | 5 | 6 | -1 | 1 |
369 | 6 | 10 | -4 | 16 |
we have only 5 common users, n=5.
Using the formula I got Spearman coefficient(𝜌) = -1.3 which is out of range for Spearman’s coefficient [ − 1 , 1 ] However, scipy.stats.spearmanr() gives 0.2727, which is within range but significantly different.
Why is my manual calculation incorrect? Am I making a mistake in how I handle missing users or the ranking?
How does scipy.stats.spearmanr handle missing values or mismatched lists internally? Does it rank only common elements, or does it consider all elements?
Where is my manual approach incorrect and how scipy.stats.spearmanr internally processes the rankings?
Upvotes: 0
Views: 68
Reputation: 94
The formula for the manual calculation is only valid if there are no ties in the rank calculation. See for example the definition and calculation part of https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
In any case your ranks puzzle me. This what scipy does implicitly and agree with the 0.2777 answer
from scipy.stats import spearmanr, rankdata
method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]
coef, _ = spearmanr(method1, method2)
print(f"Spearman coefficient: {coef}")
rank1 = rankdata(method1)
rank2 = rankdata(method2)
print(rank1)
print(rank2)
d2 = (rank1 - rank2)**2
rho = 1 - 6*sum(d2)/(10*99)
print(rho)
Upvotes: 0
Reputation: 3873
Why is my manual calculation incorrect?
That formula expects all rank numbers to be present (e.g. each list with five elements would include integers 1-5), but your lists are skipping rank numbers.
How does scipy.stats.spearmanr handle missing values or mismatched lists internally? Does it rank only common elements, or does it consider all elements?
It looks like you're expecting spearmanr
to interpret the data you're providing in a certain way:
It doesn't do that; it:
Your code is actually testing a null hypothesis of monotonicity between data represented in the following plot:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]
plt.plot(method1, method2, '*')
plt.xlabel("User ID in list 1")
plt.ylabel("Corresponding user ID in list 2")
plt.title("Correlation between paired user IDs?")
stats.spearmanr(method1, method2)
# SignificanceResult(statistic=0.2727272727272727, pvalue=0.44583834154275137)
The numbers you're giving it are user IDs, so they could just as well be names or colors. You can order user IDs because they happen to be represented with numbers, and the correlation calculation happens to produce a result, but the number it produces is meaningless.
The simplest thing you can do to perform a reasonable test is to calculate the Spearman correlation coefficient between the ranks:
method1 = [1, 2, 4, 5, 6]
method2 = [1, 7, 2, 6, 10]
plt.plot(method1, method2, '*')
plt.xlabel("Rank according to method 1")
plt.ylabel("Rank according to method 2")
plt.title("Correlation between ranks according to two methods")
stats.spearmanr(method1, method2)
# SignificanceResult(statistic=0.7, pvalue=0.1881204043741873)
This calculates the correlation coefficient something like:
x = stats.rankdata(method1)
y = stats.rankdata(method2)
# x and y each contain integers 1-5; you are now
# testing the relative ranks of these users, not
# the absolute ranks of the users in the larger data set.
di = x - y
n = len(di)
rho = 1 - (6 * (di**2).sum()) / (n * (n**2 - 1))
rho # 0.7
Upvotes: 1