Reputation: 99
I have created two classifiers, a Boosted Decision Tree (BDT), and a Neural Network (NN) to classify events as either belonging to a signal class or background class. They output a continuous probability between 0 and 1 of belonging to the signal class. I want to compare the two methods and wish to find the correlation between the two.
However I find that if I calculate the correlation coefficient of just the events that belong to the background class, or the events that just belong to the signal class, these correlations are smaller than correlation of the entire dataset. I would have assumed that as both classifiers are tested on the exact same dataset the total correlation would be a weighted average of the two individual correlations. Note that the total dataset consists of ~100,000 events.
Here I calculate the correlation for the whole dataset using the pandas .corr() function which calculates the Pearson correlation matrix:
dfBDT = pd.read_csv("BDTResults.csv")
dfNN = pd.read_csv("NNResults.csv")
# not defaulted by Event Number by default
dfBDT = dfBDT.sort_values('EventNumber')
dfNN = dfNN.sort_values('EventNumber')
# Resets index of sorted dataframe so sorted dataframe index begins at 0
dfBDT.reset_index(drop=True, inplace=True)
dfNN.reset_index(drop=True, inplace=True)
dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)
dfTotal = pd.concat([dfnum,dfscore], axis = 1)
dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']
dfTotal.corr()
This gives a 97% correlation. I then do the same just for the background events where I have defined the background events to have a class of 0:
BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
BDT_back.reset_index(drop=True, inplace=True)
BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
BDT_back_num.reset_index(drop=True, inplace=True)
NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
NN_back.reset_index(drop=True, inplace=True)
NN_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
NN_back_num.reset_index(drop=True, inplace=True)
dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
axis = 1)
dfBack.reset_index(drop=True, inplace=True)
dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']
dfBack.corr()
This gives me a correlation of about 96%. I then just repeat above for the signal events, i.e. replace class = 0 with class = 1 and I get a correlations of 91%.
Then if I try rejoining the two dataframes and calculate the total correlation again I get a higher correlation than before of 98%:
ab = pd.concat([dfBack['BDT'],dfSig['BDT']])
ba = pd.concat([dfBack['NN'],dfSig['NN']])
abba =pd.concat([ab,ba], axis = 1)
abba.corr()
The fact that these two values are different must mean that something is going wrong but I do not know where.
Upvotes: 1
Views: 250
Reputation: 107587
Ultimately, it comes down to the horizontal merges that run on indexes.
Unmatched Rows
If rows differ by both data frames, concat
that defaults to outer join will generate NaN
at unmatched indexes (on data frame of lesser rows) which would be more rows than original data frame before splitting.
Unmatched Classes
Additionally, if Class have different % shares between the two data frames, dfBDT and dfNN, their corresponding joins will return NaN
at unmatched indexes.
For example, let's say dfBDT maintains 60% and 40% between Class 0 and 1 and dfNN maintains 50% and 50% between Class 0 and 1 where comparisons include:
After horizontal join with pd.concat(..., axis = 1)
which defaults to outer join, how = 'outer'
, the resulting mismatches will generate NaN
on both sides. Even if you do use how='inner
', you will be filtering out mismatches but dfTotal never filters out any rows but includes all rows.
Sort Order
Testing between Linux and Windows machines with seeded, reproducible example indicates sorting matters, specifically by Class
first then EventNumber
matters.
This can be demonstrated with seeded, random data for a reproducible example. Below refactors your code to avoid the many pd.concat
calls with use of join
(adjusting its default to how='outer'
). Further down this code is equivalent to OP's original setup.
Data
import numpy as np
import pandas as pd
np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
'Class': np.random.randint(0, 1, 500),
'score': np.random.randn(500)
})
dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
'Class': np.random.randint(0, 1, 500),
'score': np.random.randn(500)
})
Code
dfBDT = dfBDT.sort_values(['Class', 'EventNumber']).reset_index(drop=True)
dfNN = dfNN.sort_values(['Class', 'EventNumber']).reset_index(drop=True)
# ALL ROWS (NO FILTER)
dfTotal = (dfBDT.reindex(['EventNumber', 'score'], axis='columns')
.join(dfNN.reindex(['EventNumber', 'score'], axis='columns'),
rsuffix = '_')
.set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'],
axis='columns', inplace = False)
.reindex(['EventNumberBDT','EventNumberNN','BDT','NN'],
axis='columns'))
dfTotal.corr()
# TWO FILTERED DATA FRAMES CLASS (0 FOR BACKGROUND, 1 FOR SIGNAL)
df_list = [(dfBDT.query('Class == {}'.format(i))
.reindex(['EventNumber', 'score'], axis='columns')
.join(dfNN.query('Class == {}'.format(i))
.reindex(['EventNumber', 'score'], axis='columns'),
rsuffix = '_')
.set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'],
axis='columns', inplace = False)
.reindex(['EventNumberBDT','EventNumberNN','BDT','NN'],
axis='columns')
) for i in range(0,2)]
dfSub = pd.concat(df_list)
dfSub.corr()
Output (notice they return different results)
dfTotal.corr()
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.912279 -0.024121 0.115754
# EventNumberNN 0.912279 1.000000 -0.039038 0.122905
# BDT -0.024121 -0.039038 1.000000 0.012143
# NN 0.115754 0.122905 0.012143 1.000000
dfSub.corr()
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.974140 -0.024121 0.120102
# EventNumberNN 0.974140 1.000000 -0.026026 0.122905
# BDT -0.024121 -0.026026 1.000000 0.025548
# NN 0.120102 0.122905 0.025548 1.000000
However, if we equate Class shares (such as 50% and 50% across both data frames or any share equivalent in both data frames), outputs match exactly.
np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
'Class': np.concatenate((np.zeros(250), np.ones(250))),
'score': np.random.randn(500)
})
dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
'Class': np.concatenate((np.zeros(250), np.ones(250))),
'score': np.random.randn(500)
})
...
dfTotal.corr()
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.992846 -0.026130 0.023623
# EventNumberNN 0.992846 1.000000 -0.023411 0.022093
# BDT -0.026130 -0.023411 1.000000 -0.026454
# NN 0.023623 0.022093 -0.026454 1.000000
dfSub.corr()
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.992846 -0.026130 0.023623
# EventNumberNN 0.992846 1.000000 -0.023411 0.022093
# BDT -0.026130 -0.023411 1.000000 -0.026454
# NN 0.023623 0.022093 -0.026454 1.000000
Finally, this has been tested with OP's original code:
def op_approach_total():
dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)
dfTotal = pd.concat([dfnum,dfscore], axis = 1)
dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']
return dfTotal.corr()
def op_approach_split():
# not defaulted by Event Number by default
BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
BDT_back.reset_index(drop=True, inplace=True)
BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
BDT_back_num.reset_index(drop=True, inplace=True)
NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
NN_back.reset_index(drop=True, inplace=True)
NN_back_num = (dfNN.loc[dfNN['Class'] == 0])['EventNumber']
NN_back_num.reset_index(drop=True, inplace=True)
dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
axis = 1)
dfBack.reset_index(drop=True, inplace=True)
dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']
# not defaulted by Event Number by default
BDT_sig = (dfBDT.loc[dfBDT['Class'] == 1])['score']
BDT_sig.reset_index(drop=True, inplace=True)
BDT_sig_num = (dfBDT.loc[dfBDT['Class'] == 1])['EventNumber']
BDT_sig_num.reset_index(drop=True, inplace=True)
NN_sig = (dfNN.loc[dfNN['Class'] == 1])['score']
NN_sig.reset_index(drop=True, inplace=True)
NN_sig_num = (dfNN.loc[dfNN['Class'] == 1])['EventNumber']
NN_sig_num.reset_index(drop=True, inplace=True)
dfSig = pd.concat([BDT_sig_num, NN_sig_num, BDT_sig, NN_sig],
axis = 1)
dfSig.reset_index(drop=True, inplace=True)
dfSig.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']
# ADDING EventNumber COLUMNS
ev_back = pd.concat([dfBack['EventNumberBDT'], dfSig['EventNumberBDT']])
ev_sig = pd.concat([dfBack['EventNumberNN'], dfSig['EventNumberNN']])
ab = pd.concat([dfBack['BDT'], dfSig['BDT']])
ba = pd.concat([dfBack['NN'], dfSig['NN']])
# HORIZONTAL MERGE
abba = pd.concat([ev_back, ev_sig, ab, ba], axis = 1)
return abba.corr()
opTotal = op_approach_total()
opSub = op_approach_split()
Output
opTotal = op_approach_total()
opTotal
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.992846 -0.026130 0.023623
# EventNumberNN 0.992846 1.000000 -0.023411 0.022093
# BDT -0.026130 -0.023411 1.000000 -0.026454
# NN 0.023623 0.022093 -0.026454 1.000000
opSub = op_approach_split()
opSub
# EventNumberBDT EventNumberNN BDT NN
# EventNumberBDT 1.000000 0.992846 -0.026130 0.023623
# EventNumberNN 0.992846 1.000000 -0.023411 0.022093
# BDT -0.026130 -0.023411 1.000000 -0.026454
# NN 0.023623 0.022093 -0.026454 1.000000
Upvotes: 1