Reputation: 395
I am trying to calculate t-test score of data stored in different dataframes using ttest_rel
from scipy.stats
. But when calculating t-test between same data, it is returning a numpy ndarray of NaN instead of a single NaN. What am I doing wrong that I am getting a numpy array instead of a single value?
My code with a sample dataframe is as follows:
import pandas as pd
import numpy as np
import re
from scipy.stats import ttest_rel
cities_nhl = pd.DataFrame({'metro': ['NewYork', 'LosAngeles', 'StLouis', 'Detroit', 'Boston', 'Baltimore'],
'total_ratio': [0.45, 0.51, 0.62, 0.43, 0.26, 0.32]})
cities_nba = pd.DataFrame({'metro': ['Boston', 'LosAngeles', 'Phoenix', 'Baltimore', 'Detroit', 'NewYork'],
'total_ratio': [0.50, 0.41, 0.34, 0.53, 0.33, 0.42]})
cities_mlb = pd.DataFrame({'metro': ['Seattle', 'Detroit', 'Boston', 'Baltimore', 'NewYork', 'LosAngeles'],
'total_ratio': [0.48, 0.27, 0.52, 0.33, 0.28, 0.67]})
cities_nfl = pd.DataFrame({'metro': ['LosAngeles', 'Atlanta', 'Detroit', 'Boston', 'NewYork', 'Baltimore'],
'total_ratio': [0.47, 0.41, 0.82, 0.13, 0.56, 0.42]})
needed_cols = ['metro', 'total_ratio'] #metro is a string and total_ratio is a float column
df_dict = {'NHL': cities_nhl[needed_cols], 'NBA': cities_nba[needed_cols],
'MLB': cities_mlb[needed_cols], 'NFL': cities_nfl[needed_cols]} #keeping all dataframes in a dictionary
#for ease of access
sports = ['NHL','NBA','MLB','NFL'] #name of sports
p_values_dict = {'NHL':[], 'NBA':[], 'MLB':[], 'NFL':[]} #dictionary to store p values
for clm1 in sports:
for clm2 in sports:
#merge the dataframes of two sports and then calculate their ttest score
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])[1]
p_values_dict[clm1].append(_pval)
p_values = pd.DataFrame(p_values_dict, index=sports)
p_values
| |NHL |NBA |MLB |NFL |
|-------|-----------|-----------|-----------|----------|
|NHL |[nan, nan] |0.589606 |0.826298 |0.38493 |
|NBA |0.589606 |[nan, nan] |0.779387 |0.782173 |
|MLB |0.826298 |0.779387 |[nan, nan] |0.713229 |
|NFL |0.38493 |0.782173 |0.713229 |[nan, nan]|
Upvotes: 1
Views: 481
Reputation: 3011
The problem here is actually not related to scipy
, but is due to duplicate column labels in your dataframes. In this part of your code:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
When clm1
and clm2
are equal (say they are both NHL
), you get a _df
dataframe like this:
metro total_ratio_NHL total_ratio_NHL
0 NewYork 0.45 0.45
1 LosAngeles 0.51 0.51
2 StLouis 0.62 0.62
3 Detroit 0.43 0.43
4 Boston 0.26 0.26
5 Baltimore 0.32 0.32
Then, when you pass the columns to the ttest_rel
function, you end up passing both columns when you refer to a single column label, because they have the same label:
ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])
And that's how you get two t-statistics and two p-values.
So, you can modify those two lines to eliminate duplicate column labels, like this:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}_1', f'_{clm2}_2'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}_1"], _df[f"total_ratio_{clm2}_2"])[1]
The result will look like this:
NHL NBA MLB NFL
NHL NaN 0.589606 0.826298 0.384930
NBA 0.589606 NaN 0.779387 0.782173
MLB 0.826298 0.779387 NaN 0.713229
NFL 0.384930 0.782173 0.713229 NaN
Upvotes: 2