ysearka
ysearka

Reputation: 3855

How to compute aggregations on a range of thresholds?

Let's say I have the following dataframe:

df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])

     score  correct_pred
0    0.1             0
1    0.2             1
2    0.3             1
3    0.4             0

And I would like for each row to compute the proportion of lines with score below and the proportion of correct_pred with score equal or above.

That is, for the second row for instance, 25% of the rows have a score below 0.2 and 66% of rows equal or above 0.2 have a correct pred. The output would then look like:

threshold    percentage_filtered    percentage_correct_pred
0.1          0                      0.5
0.2          0.25                   0.66
0.3          0.5                    0.5
0.4          0.75                   0

So far I do it using this piece of code:

out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
    threshold_mask = df.score < threshold
    out.loc[len(out)] = [threshold, 
                         np.mean(threshold_mask), 
                         df[~threshold_mask].correct_pred.mean()]

Which works, but it is terribly slow on a real-size dataframe. So I need a faster version, I suspect there is a more vectorized method, maybe using numpy.cumsum or something?

Upvotes: 2

Views: 191

Answers (2)

Nikolas Rieble
Nikolas Rieble

Reputation: 2621

Lets improve the runtime by a factor of 10.

For reference:

df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])

def do_it():
    out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
    for threshold in df.score:
        threshold_mask = df.score < threshold
        out.loc[len(out)] = [threshold, 
                             np.mean(threshold_mask), 
                             df[~threshold_mask].correct_pred.mean()]

%timeit do_it()

13 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Firstly, we take all call of pandas methods out of the loop such as:

def do_it1():

    score_values = df.score.values
    score_list = list(set(score_values))

    correct_pred = df.correct_pred.values

    out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])

    for threshold in score_list:
        mask = score_values < threshold
        out.loc[len(out)] = [threshold, 
                             np.mean(mask), 
                             np.mean(correct_pred[~mask])]

%timeit do_it1()

9.67 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Then we also create the dataframe only after getting the results

def do_it2():

    score_values = df.score.values
    score_list = list(set(score_values))

    correct_pred = df.correct_pred.values

    result = []
    for threshold in score_list:
        mask = score_values < threshold
        result.append((threshold,np.mean(mask),np.mean(correct_pred[~mask])))

    out = pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])

%timeit do_it2()

960 µs ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

EDIT: To take jdehesas answer into account:

df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
                  columns=['score', 'correct_pred'])
def do_it_jdehesa():
    # Group by scores and count occurrences and number of correct predictions
    df2 = (df.sort_values('score')
           .groupby('score')['correct_pred']
           .agg(['count', 'sum'])
           .reset_index())
    # Percentage of values below each threshold
    perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
    # Percentage of values above each threshold with correct prediction
    perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
    # Assemble result
    result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
    result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']

%timeit do_it_jdehesa()

13.5 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

EDIT2: Just optimizing the cuntion a little more - yet no where near as fast as jdehesas answer

def do_it5():

    dfarray = df.values
    n = dfarray.size
    score_values = dfarray[:,0]
    score_list = np.unique(score_values)    
    correct_pred = dfarray[:,1]

    result = []
    for threshold in score_list:
        mask = score_values<threshold
        result.append((threshold, np.count_nonzero(mask)/n, np.count_nonzero(correct_pred[~mask])/n))

    result =  pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])

Upvotes: 1

javidcf
javidcf

Reputation: 59731

I will assume that score may have repeated values, but if it does not it would also work (although it could be simpler). This is a way to get that result:

import pandas as pd
import numpy as np

df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
                  columns=['score', 'correct_pred'])
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
       .groupby('score')['correct_pred']
       .agg(['count', 'sum'])
       .reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
print(result)
#    threshold  percentage_filtered  percentage_correct_pred
# 0        0.1                 0.00                 0.500000
# 1        0.2                 0.25                 0.666667
# 2        0.3                 0.50                 0.500000
# 3        0.4                 0.75                 0.000000

Performance:

np.random.seed(123)

df = pd.DataFrame({'score': np.arange(0, 1, 0.0005),
                   'correct_pred':np.random.choice([1,0], size=2000)
                   })

print (df)
       score  correct_pred
0     0.0000             1
1     0.0005             0
2     0.0010             1
3     0.0015             1
4     0.0020             1
     ...           ...
1995  0.9975             0
1996  0.9980             0
1997  0.9985             1
1998  0.9990             1
1999  0.9995             1

[2000 rows x 2 columns]

In [208]: %timeit do_it_jdehesa()
9.57 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [209]: %timeit do_it()
5.83 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [210]: %timeit do_it1()
3.21 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [211]: %timeit do_it2()
92.5 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 3

Related Questions