How to compute aggregations on a range of thresholds?

Question

Let's say I have the following dataframe:

df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])

     score  correct_pred
0    0.1             0
1    0.2             1
2    0.3             1
3    0.4             0

And I would like for each row to compute the proportion of lines with score below and the proportion of correct_pred with score equal or above.

That is, for the second row for instance, 25% of the rows have a score below 0.2 and 66% of rows equal or above 0.2 have a correct pred. The output would then look like:

threshold    percentage_filtered    percentage_correct_pred
0.1          0                      0.5
0.2          0.25                   0.66
0.3          0.5                    0.5
0.4          0.75                   0

So far I do it using this piece of code:

out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
    threshold_mask = df.score < threshold
    out.loc[len(out)] = [threshold, 
                         np.mean(threshold_mask), 
                         df[~threshold_mask].correct_pred.mean()]

Which works, but it is terribly slow on a real-size dataframe. So I need a faster version, I suspect there is a more vectorized method, maybe using numpy.cumsum or something?

javidcf · Accepted Answer

I will assume that score may have repeated values, but if it does not it would also work (although it could be simpler). This is a way to get that result:

import pandas as pd
import numpy as np

df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
                  columns=['score', 'correct_pred'])
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
       .groupby('score')['correct_pred']
       .agg(['count', 'sum'])
       .reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
print(result)
#    threshold  percentage_filtered  percentage_correct_pred
# 0        0.1                 0.00                 0.500000
# 1        0.2                 0.25                 0.666667
# 2        0.3                 0.50                 0.500000
# 3        0.4                 0.75                 0.000000

Performance:

np.random.seed(123)

df = pd.DataFrame({'score': np.arange(0, 1, 0.0005),
                   'correct_pred':np.random.choice([1,0], size=2000)
                   })

print (df)
       score  correct_pred
0     0.0000             1
1     0.0005             0
2     0.0010             1
3     0.0015             1
4     0.0020             1
     ...           ...
1995  0.9975             0
1996  0.9980             0
1997  0.9985             1
1998  0.9990             1
1999  0.9995             1

[2000 rows x 2 columns]

In [208]: %timeit do_it_jdehesa()
9.57 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [209]: %timeit do_it()
5.83 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [210]: %timeit do_it1()
3.21 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [211]: %timeit do_it2()
92.5 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to compute aggregations on a range of thresholds?

Answers (2)

Related Questions