Mr.Robot
Mr.Robot

Reputation: 349

Select rows of dataframe based on column values

Problem

I am working on a machine learning project which aims to see on what kind of raw data (text) the classifiers tend to make mistakes and on what kind of data they have no consensus.

Now I have a dataframe with labels, prediction results of 2 classifiers and text data. I am wondering if there is a simple way I could select rows based on some set operations of those columns with predictions or labels.

Data might look like

   score                                             review     svm_pred  dnn_pred
0      0  I went and saw this movie last night after bei...            0         1
1      1  Actor turned director Bill Paxton follows up h...            1         1
2      1  As a recreational golfer with some knowledge o...            0         1
3      1  I saw this film in a sneak preview, and it is ...            1         1
4      1  Bill Paxton has taken the true story of the 19...            1         1
5      1  I saw this film on September 1st, 2005 in Indi...            1         1
6      1  Maybe I'm reading into this too much, but I wo...            0         1
7      1  I felt this film did have many good qualities....            1         1
8      1  This movie is amazing because the fact that th...            1         1
9      0  "Quitting" may be as much about exiting a pre-...            1         1


For example, I want to select rows both make mistakes, then the index 9 will be returned.

A made-up MWE data example is provided here

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])

which returns

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
2      0         0         0
3      1         0         0
4      0         0         1
5      0         1         1
6      1         0         1
7      0         1         1
8      1         1         1
9      1         1         1

What I Have Done

I know I could list all possible combinations, 000, 001, etc. However,

Could someone help me, thank you in advance.

Why This Question is Not a Duplicate

The existing answers only consider the case where the number of columns are limited. However, in my application, the number of predictions given by classifier (i.e. columns) could be large and this makes the existing answer not quite applicable.

At the same time, the use of pd.Series.ne function is first seen to use this in particular application, which might shed some light to people with similar confusion.

Upvotes: 1

Views: 2444

Answers (2)

Chris Adams
Chris Adams

Reputation: 18647

Create a helper Series of "number of incorrect classifers" that you can do logical operations on. This makes the assumption that true score is in column 1 and subsequent prediction values are in columns 2-onwards - You may need to update the slicing indices accordingly

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

Example Usage:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
                  columns=["score", "svm_pred", "dnn_pred"])

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

# Return rows where all classifers got it right
df[s.eq(0)]

   score  svm_pred  dnn_pred
2      0         0         0
8      1         1         1
9      1         1         1

# Return rows where 1 classifer got it wrong
df[s.eq(1)]

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
4      0         0         1
6      1         0         1

# Return rows where all classifers got it wrong
df[s.eq(2)]

   score  svm_pred  dnn_pred
3      1         0         0
5      0         1         1
7      0         1         1

Upvotes: 1

ABot
ABot

Reputation: 197

You can use set operations on the selection of rows:

# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index


 # returns indexes of those rows where both predictions are wrong
 df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index

 # returns indexes of those rows where either predictions are wrong
 df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index

If you are not only interested in the index, but the complete row, omit the last part:

# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]

Upvotes: 1

Related Questions