Reputation: 2333
I want to measure the correlation between two Conference
related metrics (AcceptanceRate
and FiveYrIF
). I have the following two DataFrames (which are already ordered / ranked accordingly):
df_if
:
Conference FiveYrIF
0 SIGMOD Conference 112.685585
1 KDD 103.674543
2 CHI 99.453096
3 SIGIR 68.967753
4 WWW 65.715631
5 SODA 60.151959
6 DAC 42.076365
7 ICCAD 39.906361
8 CIKM 33.232224
9 DATE 26.578906
10 INFOCOM 22.694122
11 Winter Simulation Conference 17.448830
12 SAC 10.646007
df_ar
:
Conference AcceptanceRate
0 CIKM 15
1 SIGIR 16
2 INFOCOM 19.7
3 KDD 21
4 DAC 22
5 DATE 23
6 WWW 24
7 CHI 25
8 ICCAD 27
9 SIGMOD Conference 27
10 SAC 29
11 SODA 29.5
12 Winter Simulation Conference 54
I want to compare the two metrics (FiveYrIF
and AcceptanceRates
) using the stats.kendalltau
method, which I have used before, but used ranking of Years (numbers) as opposed to using ranking of Conferences (text) as shown here.
I tried the following:
from scipy.stats import kendalltau
kendalltau(df_if['Conference'].values, df_ar['Conference'].values)
But it returned the following error:
TypeError: merge sort not available for item 0
I'm not quite sure what I'm doing wrong, it is my understanding that what I am comparing just has to be ordinal (ordered) and not comparable numbers. We compare orders, don't we?
I'm trying to avoid having to go back to the database and setting up some sort of numerical ID for each Conference so I can perform this if possible.
Upvotes: 1
Views: 1245
Reputation: 114921
Apparently kendalltau
does not handle the object array used by Pandas. You can work around this by converting it to an array of strings before passing it to kendalltau
.
For example, here's a DataFrame:
In [107]: df
Out[107]:
x y
0 aaa 0.5
1 bb 1.4
2 c 1.3
3 d 2.0
4 ee 2.1
The values in the x
columns are string. Pandas represents arrays of strings as arrays with data type object
:
In [108]: df['x']
Out[108]:
0 aaa
1 bb
2 c
3 d
4 ee
Name: x, dtype: object
In [109]: df['x'].values
Out[109]: array(['aaa', 'bb', 'c', 'd', 'ee'], dtype=object)
kendalltau
doesn't handle such an array:
In [110]: kendalltau(df['x'], df['y'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-110-07ca97e866e2> in <module>()
----> 1 kendalltau(df['x'], df['y'])
/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort)
3020 if initial_lexsort:
3021 # sort implemented as mergesort, worst case: O(n log(n))
-> 3022 perm = np.lexsort((y, x))
3023 else:
3024 # sort implemented as quicksort, 30% faster but with worst case: O(n^2)
TypeError: merge sort not available for item 1
In [111]: kendalltau(df['x'].values, df['y'])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-111-e903a3b3475e> in <module>()
----> 1 kendalltau(df['x'].values, df['y'])
/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort)
3020 if initial_lexsort:
3021 # sort implemented as mergesort, worst case: O(n log(n))
-> 3022 perm = np.lexsort((y, x))
3023 else:
3024 # sort implemented as quicksort, 30% faster but with worst case: O(n^2)
TypeError: merge sort not available for item 1
It works if you convert the array to an array of strings, using df['x'].values.astype(str)
:
In [112]: kendalltau(df['x'].values.astype(str), df['y'])
Out[112]: KendalltauResult(correlation=0.79999999999999982, pvalue=0.050043527347496564)
Upvotes: 1