BKS
BKS

Reputation: 2333

using python's stats.kendalltau function

I want to measure the correlation between two Conference related metrics (AcceptanceRate and FiveYrIF). I have the following two DataFrames (which are already ordered / ranked accordingly):

df_if:

                      Conference    FiveYrIF
0              SIGMOD Conference  112.685585
1                            KDD  103.674543
2                            CHI   99.453096
3                          SIGIR   68.967753
4                            WWW   65.715631
5                           SODA   60.151959
6                            DAC   42.076365
7                          ICCAD   39.906361
8                           CIKM   33.232224
9                           DATE   26.578906
10                       INFOCOM   22.694122
11  Winter Simulation Conference   17.448830
12                           SAC   10.646007 

df_ar:

                      Conference AcceptanceRate
0                           CIKM             15
1                          SIGIR             16
2                        INFOCOM           19.7
3                            KDD             21
4                            DAC             22
5                           DATE             23
6                            WWW             24
7                            CHI             25
8                          ICCAD             27
9              SIGMOD Conference             27
10                           SAC             29
11                          SODA           29.5
12  Winter Simulation Conference             54 

I want to compare the two metrics (FiveYrIF and AcceptanceRates) using the stats.kendalltau method, which I have used before, but used ranking of Years (numbers) as opposed to using ranking of Conferences (text) as shown here.

I tried the following:

from scipy.stats import kendalltau

kendalltau(df_if['Conference'].values, df_ar['Conference'].values)

But it returned the following error:

TypeError: merge sort not available for item 0

I'm not quite sure what I'm doing wrong, it is my understanding that what I am comparing just has to be ordinal (ordered) and not comparable numbers. We compare orders, don't we?

I'm trying to avoid having to go back to the database and setting up some sort of numerical ID for each Conference so I can perform this if possible.

Upvotes: 1

Views: 1245

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114921

Apparently kendalltau does not handle the object array used by Pandas. You can work around this by converting it to an array of strings before passing it to kendalltau.

For example, here's a DataFrame:

In [107]: df
Out[107]: 
     x    y
0  aaa  0.5
1   bb  1.4
2    c  1.3
3    d  2.0
4   ee  2.1

The values in the x columns are string. Pandas represents arrays of strings as arrays with data type object:

In [108]: df['x']
Out[108]: 
0    aaa
1     bb
2      c
3      d
4     ee
Name: x, dtype: object

In [109]: df['x'].values
Out[109]: array(['aaa', 'bb', 'c', 'd', 'ee'], dtype=object)

kendalltau doesn't handle such an array:

In [110]: kendalltau(df['x'], df['y'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-110-07ca97e866e2> in <module>()
----> 1 kendalltau(df['x'], df['y'])

/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort)
   3020     if initial_lexsort:
   3021         # sort implemented as mergesort, worst case: O(n log(n))
-> 3022         perm = np.lexsort((y, x))
   3023     else:
   3024         # sort implemented as quicksort, 30% faster but with worst case: O(n^2)

TypeError: merge sort not available for item 1

In [111]: kendalltau(df['x'].values, df['y'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-111-e903a3b3475e> in <module>()
----> 1 kendalltau(df['x'].values, df['y'])

/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort)
   3020     if initial_lexsort:
   3021         # sort implemented as mergesort, worst case: O(n log(n))
-> 3022         perm = np.lexsort((y, x))
   3023     else:
   3024         # sort implemented as quicksort, 30% faster but with worst case: O(n^2)

TypeError: merge sort not available for item 1

It works if you convert the array to an array of strings, using df['x'].values.astype(str):

In [112]: kendalltau(df['x'].values.astype(str), df['y'])
Out[112]: KendalltauResult(correlation=0.79999999999999982, pvalue=0.050043527347496564)

Upvotes: 1

Related Questions