Reputation: 3993
This is a really tricky statistics that I want to produce. My dataframe
contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df
:
df = pd.DataFrame(
{'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75],
'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3],
'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1],
'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
}
)
df
trip segment class prediction
0 25 0 3 0
1 25 0 3 0
2 25 0 3 3
3 25 0 3 3
4 25 0 3 3
5 25 1 3 4
6 25 1 3 4
7 25 1 3 2
8 25 1 3 2
9 54 2 1 0
10 54 2 1 0
11 54 2 1 1
12 54 2 1 1
13 73 0 2 4
14 73 0 2 2
15 73 1 2 4
16 75 1 1 0
17 75 3 1 2
From the given df
, I would like to produce statistics of model's predictions at trip
and segment
levels, using prediction
's majority votes, considering the actual class
a trip
or segment
belongs to.
Segment's statistics
So considering the above df
, I would like to produce the below segment's statistics (explanation given below):
class total-segments correctly-predicted accuracy-rate
0 - - -
1 3 1 0.33
2 2 1 0.5
3 2 1 0.5
4 - - -
class
0
, so the dash.class
type 1
(segment 2
of trip 54
and segments 1
& 3
of trip 75
). Of all the 3, only one (segment 2
of trip 54
has majority votes of its prediction
correct, so 1
correctly-predicted
and 0.33
(i.e. 1/3
) accuracy-rate.class
type 2
( segments 0
& 1
of trip 73
). Segment 0
has majority votes correct, so 1
correctly-predicted
and 0.5
(i.e. 1/2
) accuracy-rate.class
3
(segments 0
& 1
of trip 25
). Segment 0
has majority votes correct, so 1
correctly-predicted
and 0.5
(i.e. 1/5
) accuracy-rate.class
type 4
.Trip-level statistics
Similarly, considering the class
type of distinct trips in df
and their prediction
, I want to produce the following trip-level statistics (also explained below):
class total-trips correctly-predicted accuracy-rate
0 - - -
1 2 1 0.5
2 1 0 0.0
3 1 1 1.0
4 - - -
class
0
.class
type 1
(trip 54
& 75
). 1 trip was predicted correct (majority votes of trip 54
), so 1
correctly-predicted
trip, and 0.5
accuracy-rate
.class
2
(trip 73
). Its majority votes prediction is incorrect, so 0
correctly-predicted
trip, and 0.0
accuracy-rate
.class
3
(trip 25
). Its majority votes prediction is correct (3
), so 1
correctly-predicted
trip, and 1.0
accuracy-rate
.class 4
.Please forgive the long grammar, but this is a problem that one can understand only when well-explained.
Upvotes: 1
Views: 47
Reputation: 29635
You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.
res_seg = (
df['class'].eq(df['prediction'])
.groupby([df['class'],df['segment']]).mean()
.ge(0.5)
.groupby(level='class').agg(['size','sum'])
.rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
.assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
.reindex(range(5), fill_value='-')
.reset_index()
)
print(res_seg)
# class total_segments correctly_predicted accuracy_rate
# 0 0 - - -
# 1 1 3 1 0.333333
# 2 2 2 1 0.5
# 3 3 2 1 0.5
# 4 4 - - -
and similar for the trips, you would have to change the df['segment']
to df['trip']
in the groupby and maybe the name of the columns in the rename
as well as the assign
Upvotes: 1