Reputation: 23
I have implemented a Kfold-cross validation algorithm for study a machine learning problem and set the parameters of a SVM (I know about skilearn but I wanted to perform an algorithm myself). I created 5 folds and I used this to test the SVM parameter 'C' and 'tollerance'. I saved the results in a text file and then I have created a Dataframe with Pandas like this :
C tol FP TN SPE TP FN SEN
0 100 0.10000 19 261 0.9469 107 6 0.9321
1 100 0.10000 30 250 0.8319 94 19 0.8929
2 100 0.10000 28 252 0.8496 96 17 0.9000
3 100 0.10000 27 253 0.9735 110 3 0.9036
4 100 0.10000 26 254 0.9469 107 6 0.9071
5 100 0.05000 16 264 0.9381 106 7 0.9429
6 100 0.05000 22 258 0.8319 94 19 0.9214
7 100 0.05000 25 255 0.8761 99 14 0.9107
8 100 0.05000 21 259 0.9646 109 4 0.9250
9 100 0.05000 20 260 0.9823 111 2 0.9286
.......
400 1000000 0.00001 21 259 0.9558 108 5 0.9250
401 1000000 0.00001 20 260 0.8850 100 13 0.9286
402 1000000 0.00001 14 266 0.8584 97 16 0.9500
403 1000000 0.00001 17 263 0.9558 108 5 0.9393
404 1000000 0.00001 23 257 0.9735 110 3 0.9179
It has 405 rows. I need to calculate the mean of each a group of 5 elements in the columns 'SPE' and 'SEN' and than iterate the process all over the dataframe (for examples I need to calculate the mean of the elements of columns 'SPE' and 'SEN' in row 0:4, than in 5:9, than in 10:14 ... until rows 400:404). For each iteration I wanted to obtain a matrix with this values :
['C', 'tol' , 'mean of SPE', 'mean of SEN']
The matrix will have 405/5 = 81 rows and 4 columns.
So for the part of the Dataframe above I want a results link this:
[[100, 0.10000, 0.90976, 0.90714],
[100,0.05000, 0.91860, 0.92572]]
.....
[1000000,0.00001,0.91860, 0.92572]
I want to obtain this matrix because my goal is to obtain 2 plots using pyplot : one for the variable 'SPE' vs 'tol' and one for the variable 'SEN' vs 'tol' plotting different curves for each values of 'C' .
Thanks
Upvotes: 2
Views: 1448
Reputation: 862591
Use groupby
by arange
created by floor divison with agg
first
and mean
, reindex_axis
is for for change order of columns:
df = df.groupby(np.arange(len(df.index)) // 5) \
.agg({'C':'first', 'tol':'first', 'SPE':'mean','SEN':'mean'}) \
.reindex_axis(['C','tol','SPE','SEN'], axis=1) \
.rename(columns = {'SPE':'mean of SPE','SEN':'mean of SEN'})
print (df)
C tol mean of SPE mean of SEN
0 100 0.10000 0.90976 0.90714
1 100 0.05000 0.91860 0.92572
2 1000000 0.00001 0.92570 0.93216
For plotting is possible use pivot
+ plot
:
df1 = df.pivot(index='mean of SPE', columns='tol', values='C')
print (df1)
tol 0.00001 0.05000 0.10000
mean of SPE
0.90976 NaN NaN 100.0
0.91860 NaN 100.0 NaN
0.92570 1000000.0 NaN NaN
Or maybe:
df1 = df.pivot(index='C', columns='tol', values='mean of SPE')
print (df1)
tol 0.00001 0.05000 0.10000
C
100 NaN 0.9186 0.90976
1000000 0.9257 NaN NaN
df1.plot()
For numpy array add values
:
df = df.groupby(np.arange(len(df.index)) // 5) \
.agg({'C':'first', 'tol':'first', 'SPE':'mean','SEN':'mean'}) \
.reindex_axis(['C','tol','SPE','SEN'], axis=1) \
.values
print (df)
[[ 1.00000000e+02 1.00000000e-01 9.09760000e-01 9.07140000e-01]
[ 1.00000000e+02 5.00000000e-02 9.18600000e-01 9.25720000e-01]
[ 1.00000000e+06 1.00000000e-05 9.25700000e-01 9.32160000e-01]]
EDIT:
If tolerance value is unique per each 5 rows, solution for df
can be a bit different - groupby
by column tol
instead arange
:
df = df.groupby('tol', sort=False) \
.agg({'C':'first', 'SPE':'mean','SEN':'mean'}) \
.reset_index() \
.reindex_axis(['C','tol','SPE','SEN'], axis=1) \
.rename(columns = {'SPE':'mean of SPE','SEN':'mean of SEN'})
print (df)
C tol mean of SPE mean of SEN
0 100 0.10000 0.90976 0.90714
1 100 0.05000 0.91860 0.92572
2 1000000 0.00001 0.92570 0.93216
Upvotes: 1