Ben
Ben

Reputation: 329

group-by values obtained from splitting indexes

I need to find the max of two columns (p_1_logreg, p_2_logreg) where the comparison should be limited only to 14 rows.

enter image description here

My csv file

I tried to slice my index into:

    int1_str1_str2_int2_str3_int4

The max should be found between rows where int1, str1, str2 int2 and str3 are fixed, and only the int4 would change (from index 0 to index 13, and so on).

I tried to fix each element at a time and use groupby, but I couldn't iterate over int4 value only.

Here is the code to find the max for column p_1_label, but the result is not what I am looking for.

    max_1_row=raw_prob.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[1])['p_1_'+label].idxmax()]

    max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[3])['p_1_'+label].idxmax()]

    max_1_row=max_1_row.loc[raw_prob.groupby(raw_prob['id'].str.split('_').str[5])['p_1_'+label].idxmax()]

Any ideas?

Upvotes: 1

Views: 33

Answers (1)

jezrael
jezrael

Reputation: 862751

I think you need DataFrameGroupBy.idxmax by replaced last _ with empty string and then select by loc:

df = pd.read_csv('myProb.csv', index_col=[0])

idx = df.drop('id', 1).groupby(df['id'].str.replace('_\d+$', '')).idxmax()
print (idx.head(15))
                              p_0_logreg  p_1_logreg  p_2_logreg
id                                                              
6_PanaCleanerJune_sub_12_ICA           2           9           6
6_PanaCleanerJune_sub_13_ICA          17          19          23
6_PanaCleanerJune_sub_14_ICA          34          37          33
6_PanaCleanerJune_sub_15_ICA          52          51          43
6_PanaCleanerJune_sub_17_ICA          66          67          69
6_PanaCleanerJune_sub_18_ICA          82          79          76
6_PanaCleanerJune_sub_19_ICA          89          87          90
6_PanaCleanerJune_sub_20_ICA          98         103         104
6_PanaCleanerJune_sub_21_ICA         114         117         112
6_PanaCleanerJune_sub_22_ICA         129         133         127
6_PanaCleanerJune_sub_23_ICA         145         146         143
6_PanaCleanerJune_sub_24_ICA         155         166         161
6_PanaCleanerJune_sub_25_ICA         176         173         174
6_PanaCleanerJune_sub_26_ICA         186         191         189
6_PanaCleanerJune_sub_27_ICA         202         203         209

df1 = df.loc[idx['p_1_logreg']]
print (df1.head(15))
                                  id  p_0_logreg  p_1_logreg  p_2_logreg
9    6_PanaCleanerJune_sub_12_ICA_10    0.013452    0.985195    0.001353
19    6_PanaCleanerJune_sub_13_ICA_6    0.051184    0.948816    0.000000
37   6_PanaCleanerJune_sub_14_ICA_10    0.013758    0.979351    0.006890
51   6_PanaCleanerJune_sub_15_ICA_10    0.076056    0.923944    0.000000
67   6_PanaCleanerJune_sub_17_ICA_12    0.051060    0.947660    0.001280
79   6_PanaCleanerJune_sub_18_ICA_10    0.051184    0.948816    0.000000
87    6_PanaCleanerJune_sub_19_ICA_4    0.078162    0.917751    0.004087
103   6_PanaCleanerJune_sub_20_ICA_6    0.076400    0.921263    0.002337
117   6_PanaCleanerJune_sub_21_ICA_6    0.155002    0.791753    0.053245
133   6_PanaCleanerJune_sub_22_ICA_8    0.000000    0.998623    0.001377
146   6_PanaCleanerJune_sub_23_ICA_7    0.017549    0.973995    0.008457
166  6_PanaCleanerJune_sub_24_ICA_13    0.025215    0.974785    0.000000
173   6_PanaCleanerJune_sub_25_ICA_6    0.025656    0.960220    0.014124
191  6_PanaCleanerJune_sub_26_ICA_10    0.098872    0.895526    0.005602
203   6_PanaCleanerJune_sub_27_ICA_8    0.066493    0.932470    0.001037

df2 = df.loc[idx['p_2_logreg']]
print (df2.head(15))
                                  id  p_0_logreg  p_1_logreg  p_2_logreg
6     6_PanaCleanerJune_sub_12_ICA_7    0.000000    0.000351    0.999649
23   6_PanaCleanerJune_sub_13_ICA_10    0.000000    0.000351    0.999649
33    6_PanaCleanerJune_sub_14_ICA_6    0.080748    0.000352    0.918900
43    6_PanaCleanerJune_sub_15_ICA_2    0.017643    0.000360    0.981996
69   6_PanaCleanerJune_sub_17_ICA_14    0.882449    0.000290    0.117261
76    6_PanaCleanerJune_sub_18_ICA_7    0.010929    0.000360    0.988711
90    6_PanaCleanerJune_sub_19_ICA_7    0.010929    0.000351    0.988720
104   6_PanaCleanerJune_sub_20_ICA_7    0.006714    0.000360    0.992925
112   6_PanaCleanerJune_sub_21_ICA_1    0.869393    0.000339    0.130269
127   6_PanaCleanerJune_sub_22_ICA_2    0.000000    0.000351    0.999649
143   6_PanaCleanerJune_sub_23_ICA_4    0.017218    0.000360    0.982421
161   6_PanaCleanerJune_sub_24_ICA_8    0.369685    0.000712    0.629603
174   6_PanaCleanerJune_sub_25_ICA_7    0.307056    0.000496    0.692448
189   6_PanaCleanerJune_sub_26_ICA_8    0.850195    0.000368    0.149437
209  6_PanaCleanerJune_sub_27_ICA_14    0.000000    0.000351    0.999649

Detail:

print (df['id'].str.replace('_\d+$', '').head(15))
0     6_PanaCleanerJune_sub_12_ICA
1     6_PanaCleanerJune_sub_12_ICA
2     6_PanaCleanerJune_sub_12_ICA
3     6_PanaCleanerJune_sub_12_ICA
4     6_PanaCleanerJune_sub_12_ICA
5     6_PanaCleanerJune_sub_12_ICA
6     6_PanaCleanerJune_sub_12_ICA
7     6_PanaCleanerJune_sub_12_ICA
8     6_PanaCleanerJune_sub_12_ICA
9     6_PanaCleanerJune_sub_12_ICA
10    6_PanaCleanerJune_sub_12_ICA
11    6_PanaCleanerJune_sub_12_ICA
12    6_PanaCleanerJune_sub_12_ICA
13    6_PanaCleanerJune_sub_12_ICA
14    6_PanaCleanerJune_sub_13_ICA
Name: id, dtype: object

Upvotes: 2

Related Questions