Reputation: 45

Spearman Rank Correlation in SciPy vs. Manual Calculation - Mismatch in Results

I am trying to compute the Spearman rank correlation coefficient between two ranked lists using scipy.stats.spearmanr and manually using the formula:

enter image description here

However, I am getting significantly different results.

I have two lists representing the top 10 similar users identified using two different methods:

Method 1 Rankings:

User ID: [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]

Rank: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Method 2 Rankings:

User ID: [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]

Rank: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

I then applied Spearman’s rank correlation using scipy.stats.spearmanr:

from scipy.stats import spearmanr

method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]

coef, _ = spearmanr(method1, method2)
print(f"Spearman coefficient: {coef}")

Spearman coefficient: 0.2727

Manual Calculation: To verify, I computed ρ manually using the common users between both lists.

User ID	Rank (Method 1)	Rank (Method 2)	(d_i = R_1 - R_2)	(d_i^2)
301	1	1	0	0
597	2	7	-5	25
477	4	2	2	4
57	5	6	-1	1
369	6	10	-4	16

we have only 5 common users, n=5.

Using the formula I got Spearman coefficient(𝜌) = -1.3 which is out of range for Spearman’s coefficient [ − 1 , 1 ] However, scipy.stats.spearmanr() gives 0.2727, which is within range but significantly different.

Why is my manual calculation incorrect? Am I making a mistake in how I handle missing users or the ranking?

How does scipy.stats.spearmanr handle missing values or mismatched lists internally? Does it rank only common elements, or does it consider all elements?

Where is my manual approach incorrect and how scipy.stats.spearmanr internally processes the rankings?

Upvotes: 0

Answers (2)

J Suay

Reputation: 94

The formula for the manual calculation is only valid if there are no ties in the rank calculation. See for example the definition and calculation part of https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

In any case your ranks puzzle me. This what scipy does implicitly and agree with the 0.2777 answer

from scipy.stats import spearmanr, rankdata

method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]

coef, _ = spearmanr(method1, method2)
print(f"Spearman coefficient: {coef}")

rank1 = rankdata(method1)
rank2 = rankdata(method2)
print(rank1)
print(rank2)

d2 = (rank1 - rank2)**2
rho = 1 - 6*sum(d2)/(10*99)
print(rho)

Upvotes: 0

Matt Haberland

Reputation: 3873

Why is my manual calculation incorrect?

That formula expects all rank numbers to be present (e.g. each list with five elements would include integers 1-5), but your lists are skipping rank numbers.

How does scipy.stats.spearmanr handle missing values or mismatched lists internally? Does it rank only common elements, or does it consider all elements?

It looks like you're expecting spearmanr to interpret the data you're providing in a certain way:

know that you are passing lists of IDs that are in ranked order (ranked according to some other criterion),
look for IDs present in both lists so it can treat them as pairs, and
calculate the correlation coefficient between the pairs of ranks

It doesn't do that; it:

computes the ranks of the numbers in each list you give it, then
computes the correlation coefficient between the ranks, treating elements at the same index as paired.

Your code is actually testing a null hypothesis of monotonicity between data represented in the following plot:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

method1 = [301, 597, 414, 477, 57, 369, 206, 535, 590, 418]
method2 = [301, 477, 19, 120, 75, 57, 597, 160, 577, 369]

plt.plot(method1, method2, '*')
plt.xlabel("User ID in list 1")
plt.ylabel("Corresponding user ID in list 2")
plt.title("Correlation between paired user IDs?")

stats.spearmanr(method1, method2)
# SignificanceResult(statistic=0.2727272727272727, pvalue=0.44583834154275137)

The numbers you're giving it are user IDs, so they could just as well be names or colors. You can order user IDs because they happen to be represented with numbers, and the correlation calculation happens to produce a result, but the number it produces is meaningless.

The simplest thing you can do to perform a reasonable test is to calculate the Spearman correlation coefficient between the ranks:

method1 = [1, 2, 4, 5, 6]
method2 = [1, 7, 2, 6, 10]

plt.plot(method1, method2, '*')
plt.xlabel("Rank according to method 1")
plt.ylabel("Rank according to method 2")
plt.title("Correlation between ranks according to two methods")

stats.spearmanr(method1, method2)
# SignificanceResult(statistic=0.7, pvalue=0.1881204043741873)

This calculates the correlation coefficient something like:

x = stats.rankdata(method1)
y = stats.rankdata(method2)
# x and y each contain integers 1-5; you are now
# testing the relative ranks of these users, not
# the absolute ranks of the users in the larger data set.
di = x - y
n = len(di)
rho = 1 - (6 * (di**2).sum()) / (n * (n**2 - 1))
rho  # 0.7

Upvotes: 1

Spearman Rank Correlation in SciPy vs. Manual Calculation - Mismatch in Results

Answers (2)

Related Questions