MARCO LAGALLA
MARCO LAGALLA

Reputation: 267

Pandas extract subset from dataframe

I have a pandas dataframe like the following:

 index  Validation_Set  Topics   Alpha       Beta  Coherence
 0      75% Corpus      14         0.5        0.5   0.501483
 1      75% Corpus      14         0.5  symmetric   0.481676
 2     100% Corpus      14  asymmetric        0.5   0.500620
 3     100% Corpus      14         0.5  symmetric   0.492288
 4      75% Corpus      12         0.5        0.5   0.511823
 5      75% Corpus      12         0.5  symmetric   0.477614
 6     100% Corpus      12  asymmetric        0.5   0.489424
 7     100% Corpus      12         0.5  symmetric   0.541270
 8      75% Corpus       4         0.5        0.5   0.515683
 9      75% Corpus       4         0.5  symmetric   0.430614
10     100% Corpus       4  asymmetric        0.5   0.489324
11     100% Corpus       4         0.5  symmetric   0.473570

And so on... these are results from several tests for parameter tuning.

Now I want to extract all the information (all tests on parameters) only about the best model, which is the one(or maybe more than one) that has achieved the highest value of 'Coherence' on the full validation set (100% Corpus).

In this example I would get [ERROR, SEE EDIT]:

 index  Validation_Set  Topics   Alpha       Beta  Coherence
 7     100% Corpus      12         0.5  symmetric   0.541270

I managed to retrieve the row with the highest value for 'Coherence' in this way (df is the full dataframe):

corpus_100 = df[df['Validation_Set']=='100% Corpus']
topics_num = df.iloc[[corpus_100['Coherence'].idxmax()]]['Topics'].values[0]
opt_model = corpus_100[corpus_100['Topics']==topics_num]

And is working, but it's really a mess, then I'm looking for a more clear way to implement this.

Thank you!

EDIT: I'm really sorry, but there was a typo in the desired output that actually is:

 4      75% Corpus      12         0.5        0.5   0.511823
 5      75% Corpus      12         0.5  symmetric   0.477614
 6     100% Corpus      12  asymmetric        0.5   0.489424
 7     100% Corpus      12         0.5  symmetric   0.541270

Upvotes: 0

Views: 73

Answers (2)

merit_2
merit_2

Reputation: 471

Try this,

df[df['Coherence']==df['Coherence'].max()]

df[df['column']==value] filters the dataframe for whatever you are looking for.

df['column']max() returns the maximum value in 'column'.

Putting them together will return the row of the dataframe with the maximum value in Coherence

Upvotes: 1

G. Anderson
G. Anderson

Reputation: 5955

Looks like nlargest() is exactly what you need

df[df['Validation_Set']=='100% Corpus'].nlargest(1,'Coherence')

    index   Validation_Set  Topics  Alpha   Beta        Coherence
    7       100%Corpus      12      0.5     symmetric   0.54127

Upvotes: 0

Related Questions