arilwan
arilwan

Reputation: 3993

How to summarise dataframe by way of majority votes of a column

This is a really tricky statistics that I want to produce. My dataframe contains information about true classes and prediction results of a machine learning model, for trips and corresponding trips' segments. The problem can best be explained with example, so I give the following example df:

df = pd.DataFrame(
    {'trip': [25, 25, 25, 25, 25, 25, 25, 25, 25, 54, 54, 54, 54,73,73,73,75,75], 
     'segment': [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2,0,0,1,1,3], 
     'class': [3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,2,2,2,1,1], 
     'prediction': [0, 0, 3, 3, 3, 4, 4, 2, 2, 0, 0, 1, 1,4,2,4,0,2]
     }
)

df
   trip segment class prediction
0   25    0      3      0
1   25    0      3      0
2   25    0      3      3
3   25    0      3      3
4   25    0      3      3
5   25    1      3      4
6   25    1      3      4
7   25    1      3      2
8   25    1      3      2
9   54    2      1      0
10  54    2      1      0
11  54    2      1      1
12  54    2      1      1
13  73    0      2      4
14  73    0      2      2
15  73    1      2      4
16  75    1      1      0
17  75    3      1      2

From the given df, I would like to produce statistics of model's predictions at trip and segment levels, using prediction's majority votes, considering the actual class a trip or segment belongs to.

Segment's statistics

So considering the above df, I would like to produce the below segment's statistics (explanation given below):

class total-segments  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       3                  1                 0.33
  2       2                  1                 0.5
  3       2                  1                 0.5
  4       -                  -                  -

Trip-level statistics

Similarly, considering the class type of distinct trips in df and their prediction, I want to produce the following trip-level statistics (also explained below):

class total-trips  correctly-predicted  accuracy-rate
  0       -                  -                  -
  1       2                  1                 0.5
  2       1                  0                 0.0
  3       1                  1                 1.0
  4       -                  -                  - 

Please forgive the long grammar, but this is a problem that one can understand only when well-explained.

Upvotes: 1

Views: 47

Answers (1)

Ben.T
Ben.T

Reputation: 29635

You can do it this way. you can comment all but the first line and then uncomment one by one to see what is happening with the command line.

res_seg = (
    df['class'].eq(df['prediction'])
      .groupby([df['class'],df['segment']]).mean()
      .ge(0.5)
      .groupby(level='class').agg(['size','sum'])
      .rename(columns={'size':'total_segments','sum':'correctly_predicted'})\
      .assign(accuracy_rate = lambda x: x['correctly_predicted']/x['total_segments'])
      .reindex(range(5), fill_value='-')
      .reset_index()
)
print(res_seg)
#    class total_segments correctly_predicted accuracy_rate
# 0      0              -                   -             -
# 1      1              3                   1      0.333333
# 2      2              2                   1           0.5
# 3      3              2                   1           0.5
# 4      4              -                   -             -

and similar for the trips, you would have to change the df['segment'] to df['trip'] in the groupby and maybe the name of the columns in the rename as well as the assign

Upvotes: 1

Related Questions