A. Barnabo
A. Barnabo

Reputation: 23

Python pandas dataframe mean and plot values

I have implemented a Kfold-cross validation algorithm for study a machine learning problem and set the parameters of a SVM (I know about skilearn but I wanted to perform an algorithm myself). I created 5 folds and I used this to test the SVM parameter 'C' and 'tollerance'. I saved the results in a text file and then I have created a Dataframe with Pandas like this :

           C      tol  FP   TN     SPE   TP  FN     SEN
0        100  0.10000  19  261  0.9469  107   6  0.9321
1        100  0.10000  30  250  0.8319   94  19  0.8929
2        100  0.10000  28  252  0.8496   96  17  0.9000
3        100  0.10000  27  253  0.9735  110   3  0.9036
4        100  0.10000  26  254  0.9469  107   6  0.9071
5        100  0.05000  16  264  0.9381  106   7  0.9429
6        100  0.05000  22  258  0.8319   94  19  0.9214
7        100  0.05000  25  255  0.8761   99  14  0.9107
8        100  0.05000  21  259  0.9646  109   4  0.9250
9        100  0.05000  20  260  0.9823  111   2  0.9286

.......
400  1000000  0.00001  21  259  0.9558  108   5  0.9250
401  1000000  0.00001  20  260  0.8850  100  13  0.9286
402  1000000  0.00001  14  266  0.8584   97  16  0.9500
403  1000000  0.00001  17  263  0.9558  108   5  0.9393
404  1000000  0.00001  23  257  0.9735  110   3  0.9179

It has 405 rows. I need to calculate the mean of each a group of 5 elements in the columns 'SPE' and 'SEN' and than iterate the process all over the dataframe (for examples I need to calculate the mean of the elements of columns 'SPE' and 'SEN' in row 0:4, than in 5:9, than in 10:14 ... until rows 400:404). For each iteration I wanted to obtain a matrix with this values :

['C', 'tol' , 'mean of SPE', 'mean of SEN']

The matrix will have 405/5 = 81 rows and 4 columns.

So for the part of the Dataframe above I want a results link this:

[[100, 0.10000, 0.90976, 0.90714],
 [100,0.05000, 0.91860, 0.92572]]
.....
[1000000,0.00001,0.91860, 0.92572]

I want to obtain this matrix because my goal is to obtain 2 plots using pyplot : one for the variable 'SPE' vs 'tol' and one for the variable 'SEN' vs 'tol' plotting different curves for each values of 'C' .
Thanks

Upvotes: 2

Views: 1448

Answers (1)

jezrael
jezrael

Reputation: 862591

Use groupby by arange created by floor divison with agg first and mean, reindex_axis is for for change order of columns:

df = df.groupby(np.arange(len(df.index)) // 5) \
       .agg({'C':'first', 'tol':'first', 'SPE':'mean','SEN':'mean'}) \
       .reindex_axis(['C','tol','SPE','SEN'], axis=1) \
       .rename(columns = {'SPE':'mean of SPE','SEN':'mean of SEN'})
print (df)
         C      tol  mean of SPE  mean of SEN
0      100  0.10000      0.90976      0.90714
1      100  0.05000      0.91860      0.92572
2  1000000  0.00001      0.92570      0.93216

For plotting is possible use pivot + plot:

df1 = df.pivot(index='mean of SPE', columns='tol', values='C')
print (df1)
tol            0.00001  0.05000  0.10000
mean of SPE                             
0.90976            NaN      NaN    100.0
0.91860            NaN    100.0      NaN
0.92570      1000000.0      NaN      NaN

Or maybe:

df1 = df.pivot(index='C', columns='tol', values='mean of SPE')
print (df1)
tol      0.00001  0.05000  0.10000
C                                 
100          NaN   0.9186  0.90976
1000000   0.9257      NaN      NaN

df1.plot()

For numpy array add values:

df = df.groupby(np.arange(len(df.index)) // 5) \
       .agg({'C':'first', 'tol':'first', 'SPE':'mean','SEN':'mean'}) \
       .reindex_axis(['C','tol','SPE','SEN'], axis=1) \
       .values
print (df)
[[  1.00000000e+02   1.00000000e-01   9.09760000e-01   9.07140000e-01]
 [  1.00000000e+02   5.00000000e-02   9.18600000e-01   9.25720000e-01]
 [  1.00000000e+06   1.00000000e-05   9.25700000e-01   9.32160000e-01]]

EDIT:

If tolerance value is unique per each 5 rows, solution for df can be a bit different - groupby by column tol instead arange:

df = df.groupby('tol', sort=False) \
       .agg({'C':'first', 'SPE':'mean','SEN':'mean'}) \
       .reset_index() \
       .reindex_axis(['C','tol','SPE','SEN'], axis=1) \
       .rename(columns = {'SPE':'mean of SPE','SEN':'mean of SEN'})
print (df)
         C      tol  mean of SPE  mean of SEN
0      100  0.10000      0.90976      0.90714
1      100  0.05000      0.91860      0.92572
2  1000000  0.00001      0.92570      0.93216

Upvotes: 1

Related Questions