Correlation coefficient of total population is greater than population samples

Question

I have created two classifiers, a Boosted Decision Tree (BDT), and a Neural Network (NN) to classify events as either belonging to a signal class or background class. They output a continuous probability between 0 and 1 of belonging to the signal class. I want to compare the two methods and wish to find the correlation between the two.

However I find that if I calculate the correlation coefficient of just the events that belong to the background class, or the events that just belong to the signal class, these correlations are smaller than correlation of the entire dataset. I would have assumed that as both classifiers are tested on the exact same dataset the total correlation would be a weighted average of the two individual correlations. Note that the total dataset consists of ~100,000 events.

Here I calculate the correlation for the whole dataset using the pandas .corr() function which calculates the Pearson correlation matrix:

dfBDT = pd.read_csv("BDTResults.csv")
dfNN = pd.read_csv("NNResults.csv")

# not defaulted by Event Number by default
dfBDT = dfBDT.sort_values('EventNumber')
dfNN = dfNN.sort_values('EventNumber')

# Resets index of sorted dataframe so sorted dataframe index begins at 0
dfBDT.reset_index(drop=True, inplace=True)
dfNN.reset_index(drop=True, inplace=True)

dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)

dfTotal = pd.concat([dfnum,dfscore], axis = 1)
dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']

dfTotal.corr()

This gives a 97% correlation. I then do the same just for the background events where I have defined the background events to have a class of 0:

BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
BDT_back.reset_index(drop=True, inplace=True)

BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
BDT_back_num.reset_index(drop=True, inplace=True)


NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
NN_back.reset_index(drop=True, inplace=True)

NN_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
NN_back_num.reset_index(drop=True, inplace=True)



dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
                   axis = 1)
dfBack.reset_index(drop=True, inplace=True)

dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']

dfBack.corr()

This gives me a correlation of about 96%. I then just repeat above for the signal events, i.e. replace class = 0 with class = 1 and I get a correlations of 91%.

Then if I try rejoining the two dataframes and calculate the total correlation again I get a higher correlation than before of 98%:

ab = pd.concat([dfBack['BDT'],dfSig['BDT']])
ba = pd.concat([dfBack['NN'],dfSig['NN']])

abba =pd.concat([ab,ba], axis = 1)
abba.corr()

The fact that these two values are different must mean that something is going wrong but I do not know where.

Parfait · Accepted Answer

Ultimately, it comes down to the horizontal merges that run on indexes.

Unmatched Rows

If rows differ by both data frames, concat that defaults to outer join will generate NaN at unmatched indexes (on data frame of lesser rows) which would be more rows than original data frame before splitting.

Unmatched Classes

Additionally, if Class have different % shares between the two data frames, dfBDT and dfNN, their corresponding joins will return NaN at unmatched indexes.

For example, let's say dfBDT maintains 60% and 40% between Class 0 and 1 and dfNN maintains 50% and 50% between Class 0 and 1 where comparisons include:

BDT Class 0 will have more rows than NN Class 0
BDT Class 1 will have less rows than NN Class 1

After horizontal join with pd.concat(..., axis = 1) which defaults to outer join, how = 'outer', the resulting mismatches will generate NaN on both sides. Even if you do use how='inner', you will be filtering out mismatches but dfTotal never filters out any rows but includes all rows.

Sort Order

Testing between Linux and Windows machines with seeded, reproducible example indicates sorting matters, specifically by Class first then EventNumber matters.

This can be demonstrated with seeded, random data for a reproducible example. Below refactors your code to avoid the many pd.concat calls with use of join (adjusting its default to how='outer'). Further down this code is equivalent to OP's original setup.

Data

import numpy as np
import pandas as pd

np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                      'Class': np.random.randint(0, 1, 500),
                      'score': np.random.randn(500)
                     })


dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                     'Class': np.random.randint(0, 1, 500),
                     'score': np.random.randn(500)
                    })

Code

dfBDT = dfBDT.sort_values(['Class', 'EventNumber']).reset_index(drop=True)    
dfNN = dfNN.sort_values(['Class', 'EventNumber']).reset_index(drop=True)  

# ALL ROWS (NO FILTER)
dfTotal = (dfBDT.reindex(['EventNumber', 'score'], axis='columns')
                .join(dfNN.reindex(['EventNumber', 'score'], axis='columns'),
                      rsuffix = '_')
                .set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'], 
                          axis='columns', inplace = False)
                .reindex(['EventNumberBDT','EventNumberNN','BDT','NN'], 
                         axis='columns'))    
dfTotal.corr()

# TWO FILTERED DATA FRAMES CLASS (0 FOR BACKGROUND, 1 FOR SIGNAL)
df_list = [(dfBDT.query('Class == {}'.format(i))
                 .reindex(['EventNumber', 'score'], axis='columns')
                 .join(dfNN.query('Class == {}'.format(i))
                           .reindex(['EventNumber', 'score'], axis='columns'),
                       rsuffix = '_')
                 .set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'],
                           axis='columns', inplace = False)

                 .reindex(['EventNumberBDT','EventNumberNN','BDT','NN'],
                          axis='columns')
           ) for i in range(0,2)]

dfSub = pd.concat(df_list)

dfSub.corr()

Output (notice they return different results)

dfTotal.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.912279 -0.024121  0.115754
# EventNumberNN         0.912279       1.000000 -0.039038  0.122905
# BDT                  -0.024121      -0.039038  1.000000  0.012143
# NN                    0.115754       0.122905  0.012143  1.000000

dfSub.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.974140 -0.024121  0.120102
# EventNumberNN         0.974140       1.000000 -0.026026  0.122905
# BDT                  -0.024121      -0.026026  1.000000  0.025548
# NN                    0.120102       0.122905  0.025548  1.000000

However, if we equate Class shares (such as 50% and 50% across both data frames or any share equivalent in both data frames), outputs match exactly.

np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                      'Class': np.concatenate((np.zeros(250), np.ones(250))),
                      'score': np.random.randn(500)
                     })


dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                     'Class': np.concatenate((np.zeros(250), np.ones(250))),
                     'score': np.random.randn(500)
                    })

...

dfTotal.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000


dfSub.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

Finally, this has been tested with OP's original code:

def op_approach_total():
    dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
    dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)

    dfTotal = pd.concat([dfnum,dfscore], axis = 1)
    dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']

    return dfTotal.corr()


def op_approach_split():
    # not defaulted by Event Number by default
    BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
    BDT_back.reset_index(drop=True, inplace=True)

    BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
    BDT_back_num.reset_index(drop=True, inplace=True)


    NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
    NN_back.reset_index(drop=True, inplace=True)

    NN_back_num = (dfNN.loc[dfNN['Class'] == 0])['EventNumber'] 
    NN_back_num.reset_index(drop=True, inplace=True)


    dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
                       axis = 1)
    dfBack.reset_index(drop=True, inplace=True)
    dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']


    # not defaulted by Event Number by default
    BDT_sig = (dfBDT.loc[dfBDT['Class'] == 1])['score']
    BDT_sig.reset_index(drop=True, inplace=True)

    BDT_sig_num = (dfBDT.loc[dfBDT['Class'] == 1])['EventNumber']
    BDT_sig_num.reset_index(drop=True, inplace=True)

    NN_sig = (dfNN.loc[dfNN['Class'] == 1])['score']
    NN_sig.reset_index(drop=True, inplace=True)

    NN_sig_num = (dfNN.loc[dfNN['Class'] == 1])['EventNumber']
    NN_sig_num.reset_index(drop=True, inplace=True)


    dfSig = pd.concat([BDT_sig_num, NN_sig_num, BDT_sig, NN_sig],
                       axis = 1)
    dfSig.reset_index(drop=True, inplace=True)
    dfSig.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']

    # ADDING EventNumber COLUMNS
    ev_back = pd.concat([dfBack['EventNumberBDT'], dfSig['EventNumberBDT']])
    ev_sig = pd.concat([dfBack['EventNumberNN'], dfSig['EventNumberNN']])


    ab = pd.concat([dfBack['BDT'], dfSig['BDT']])

    ba = pd.concat([dfBack['NN'], dfSig['NN']])

    # HORIZONTAL MERGE
    abba = pd.concat([ev_back, ev_sig, ab, ba], axis = 1)

    return abba.corr()

opTotal = op_approach_total()
opSub = op_approach_split()

Output

opTotal = op_approach_total()
opTotal
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

opSub = op_approach_split()
opSub
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

Correlation coefficient of total population is greater than population samples

Answers (1)

Related Questions