Reputation: 3855
Let's say I have the following dataframe:
df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])
score correct_pred
0 0.1 0
1 0.2 1
2 0.3 1
3 0.4 0
And I would like for each row to compute the proportion of lines with score below and the proportion of correct_pred with score equal or above.
That is, for the second row for instance, 25% of the rows have a score below 0.2 and 66% of rows equal or above 0.2 have a correct pred. The output would then look like:
threshold percentage_filtered percentage_correct_pred
0.1 0 0.5
0.2 0.25 0.66
0.3 0.5 0.5
0.4 0.75 0
So far I do it using this piece of code:
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
threshold_mask = df.score < threshold
out.loc[len(out)] = [threshold,
np.mean(threshold_mask),
df[~threshold_mask].correct_pred.mean()]
Which works, but it is terribly slow on a real-size dataframe. So I need a faster version, I suspect there is a more vectorized method, maybe using numpy.cumsum
or something?
Upvotes: 2
Views: 191
Reputation: 2621
Lets improve the runtime by a factor of 10.
For reference:
df = pd.DataFrame([[0.1,0],[0.2,1],[0.3,1],[0.4,0]], columns = ['score', 'correct_pred'])
def do_it():
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in df.score:
threshold_mask = df.score < threshold
out.loc[len(out)] = [threshold,
np.mean(threshold_mask),
df[~threshold_mask].correct_pred.mean()]
%timeit do_it()
13 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Firstly, we take all call of pandas methods out of the loop such as:
def do_it1():
score_values = df.score.values
score_list = list(set(score_values))
correct_pred = df.correct_pred.values
out = pd.DataFrame(columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
for threshold in score_list:
mask = score_values < threshold
out.loc[len(out)] = [threshold,
np.mean(mask),
np.mean(correct_pred[~mask])]
%timeit do_it1()
9.67 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then we also create the dataframe only after getting the results
def do_it2():
score_values = df.score.values
score_list = list(set(score_values))
correct_pred = df.correct_pred.values
result = []
for threshold in score_list:
mask = score_values < threshold
result.append((threshold,np.mean(mask),np.mean(correct_pred[~mask])))
out = pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
%timeit do_it2()
960 µs ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
EDIT: To take jdehesas answer into account:
df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
columns=['score', 'correct_pred'])
def do_it_jdehesa():
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
.groupby('score')['correct_pred']
.agg(['count', 'sum'])
.reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
%timeit do_it_jdehesa()
13.5 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
EDIT2: Just optimizing the cuntion a little more - yet no where near as fast as jdehesas answer
def do_it5():
dfarray = df.values
n = dfarray.size
score_values = dfarray[:,0]
score_list = np.unique(score_values)
correct_pred = dfarray[:,1]
result = []
for threshold in score_list:
mask = score_values<threshold
result.append((threshold, np.count_nonzero(mask)/n, np.count_nonzero(correct_pred[~mask])/n))
result = pd.DataFrame(result, columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred'])
Upvotes: 1
Reputation: 59731
I will assume that score
may have repeated values, but if it does not it would also work (although it could be simpler). This is a way to get that result:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0.1, 0], [0.2, 1], [0.3, 1], [0.4, 0]],
columns=['score', 'correct_pred'])
# Group by scores and count occurrences and number of correct predictions
df2 = (df.sort_values('score')
.groupby('score')['correct_pred']
.agg(['count', 'sum'])
.reset_index())
# Percentage of values below each threshold
perc_filtered = df2['count'].shift(1).fillna(0).cumsum() / df2['count'].sum()
# Percentage of values above each threshold with correct prediction
perc_correct_pred = df2['sum'][::-1].cumsum()[::-1] / df2['count'][::-1].cumsum()[::-1]
# Assemble result
result = pd.concat([df2['score'], perc_filtered, perc_correct_pred], axis=1)
result.columns = ['threshold', 'percentage_filtered', 'percentage_correct_pred']
print(result)
# threshold percentage_filtered percentage_correct_pred
# 0 0.1 0.00 0.500000
# 1 0.2 0.25 0.666667
# 2 0.3 0.50 0.500000
# 3 0.4 0.75 0.000000
Performance:
np.random.seed(123)
df = pd.DataFrame({'score': np.arange(0, 1, 0.0005),
'correct_pred':np.random.choice([1,0], size=2000)
})
print (df)
score correct_pred
0 0.0000 1
1 0.0005 0
2 0.0010 1
3 0.0015 1
4 0.0020 1
... ...
1995 0.9975 0
1996 0.9980 0
1997 0.9985 1
1998 0.9990 1
1999 0.9995 1
[2000 rows x 2 columns]
In [208]: %timeit do_it_jdehesa()
9.57 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [209]: %timeit do_it()
5.83 s ± 181 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [210]: %timeit do_it1()
3.21 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [211]: %timeit do_it2()
92.5 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 3