Why does SciPy ttest_rel return numpy ndarray of NaN?

Question

I am trying to calculate t-test score of data stored in different dataframes using ttest_rel from scipy.stats. But when calculating t-test between same data, it is returning a numpy ndarray of NaN instead of a single NaN. What am I doing wrong that I am getting a numpy array instead of a single value?

My code with a sample dataframe is as follows:

import pandas as pd
import numpy as np
import re
from scipy.stats import ttest_rel

cities_nhl = pd.DataFrame({'metro': ['NewYork', 'LosAngeles', 'StLouis', 'Detroit', 'Boston', 'Baltimore'],
                           'total_ratio': [0.45, 0.51, 0.62, 0.43, 0.26, 0.32]})
cities_nba = pd.DataFrame({'metro': ['Boston', 'LosAngeles', 'Phoenix', 'Baltimore', 'Detroit', 'NewYork'],
                           'total_ratio': [0.50, 0.41, 0.34, 0.53, 0.33, 0.42]})
cities_mlb = pd.DataFrame({'metro': ['Seattle', 'Detroit', 'Boston', 'Baltimore', 'NewYork', 'LosAngeles'],
                           'total_ratio': [0.48, 0.27, 0.52, 0.33, 0.28, 0.67]})
cities_nfl = pd.DataFrame({'metro': ['LosAngeles', 'Atlanta', 'Detroit', 'Boston', 'NewYork', 'Baltimore'],
                           'total_ratio': [0.47, 0.41, 0.82, 0.13, 0.56, 0.42]})

needed_cols = ['metro', 'total_ratio'] #metro is a string and total_ratio is a float column
df_dict = {'NHL': cities_nhl[needed_cols], 'NBA': cities_nba[needed_cols],
           'MLB': cities_mlb[needed_cols], 'NFL': cities_nfl[needed_cols]} #keeping all dataframes in a dictionary  
                                                                           #for ease of access
sports = ['NHL','NBA','MLB','NFL'] #name of sports
p_values_dict = {'NHL':[], 'NBA':[], 'MLB':[], 'NFL':[]} #dictionary to store p values
for clm1 in sports:
    for clm2 in sports:
        #merge the dataframes of two sports and then calculate their ttest score
        _df = pd.merge(df_dict[clm1], df_dict[clm2], 
                          how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
        _pval = ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])[1]
        p_values_dict[clm1].append(_pval)
p_values = pd.DataFrame(p_values_dict, index=sports)
p_values

|       |NHL        |NBA        |MLB        |NFL       |
|-------|-----------|-----------|-----------|----------|
|NHL    |[nan, nan] |0.589606   |0.826298   |0.38493   |
|NBA    |0.589606   |[nan, nan] |0.779387   |0.782173  |
|MLB    |0.826298   |0.779387   |[nan, nan] |0.713229  |
|NFL    |0.38493    |0.782173   |0.713229   |[nan, nan]|

AlexK · Accepted Answer

The problem here is actually not related to scipy, but is due to duplicate column labels in your dataframes. In this part of your code:

        _df = pd.merge(df_dict[clm1], df_dict[clm2], 
                          how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])

When clm1 and clm2 are equal (say they are both NHL), you get a _df dataframe like this:

        metro  total_ratio_NHL  total_ratio_NHL
0     NewYork             0.45             0.45
1  LosAngeles             0.51             0.51
2     StLouis             0.62             0.62
3     Detroit             0.43             0.43
4      Boston             0.26             0.26
5   Baltimore             0.32             0.32

Then, when you pass the columns to the ttest_rel function, you end up passing both columns when you refer to a single column label, because they have the same label:

ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])

And that's how you get two t-statistics and two p-values.

So, you can modify those two lines to eliminate duplicate column labels, like this:

        _df = pd.merge(df_dict[clm1], df_dict[clm2], 
                          how='inner', on='metro', suffixes=[f'_{clm1}_1', f'_{clm2}_2'])
        _pval = ttest_rel(_df[f"total_ratio_{clm1}_1"], _df[f"total_ratio_{clm2}_2"])[1]

The result will look like this:

         NHL         NBA         MLB         NFL
NHL      NaN    0.589606    0.826298    0.384930
NBA 0.589606         NaN    0.779387    0.782173
MLB 0.826298    0.779387         NaN    0.713229
NFL 0.384930    0.782173    0.713229         NaN

Why does SciPy ttest_rel return numpy ndarray of NaN?

Answers (1)

Related Questions