vpk
vpk

Reputation: 1320

Pandas: cannot filter based on string equality

Using pandas 0.16.2 on python 2.7, OSX.

I read a data-frame from a csv file like this:

import pandas as pd

data = pd.read_csv("my_csv_file.csv",sep='\t', skiprows=(0), header=(0))

The output of data.dtypes is:

name       object
weight     float64
ethnicity  object
dtype: object

I was expecting string types for name, and ethnicity. But I found reasons here on SO on why they're "object" in newer pandas versions.

Now, I want to select rows based on ethnicity, for example:

data[data['ethnicity']=='Asian']
Out[3]: 
Empty DataFrame
Columns: [name, weight, ethnicity]
Index: []

I get the same result with data[data.ethnicity=='Asian'] or data[data['ethnicity']=="Asian"].

But when I try the following:

data[data['ethnicity'].str.contains('Asian')].head(3)

I get the results I want.

However, I do not want to use "contains"- I would like to check for direct equality.

Please note that data[data['ethnicity'].str=='Asian'] raises an error.

Am I doing something wrong? How to do this correctly?

Upvotes: 23

Views: 53120

Answers (2)

unutbu
unutbu

Reputation: 879739

There is probably whitespace in your strings, for example,

data = pd.DataFrame({'ethnicity':[' Asian', '  Asian']})
data.loc[data['ethnicity'].str.contains('Asian'), 'ethnicity'].tolist()
# [' Asian', '  Asian']
print(data[data['ethnicity'].str.contains('Asian')])

yields

  ethnicity
0     Asian
1     Asian

To strip the leading or trailing whitespace off the strings, you could use

data['ethnicity'] = data['ethnicity'].str.strip()

after which,

data.loc[data['ethnicity'] == 'Asian']

yields

  ethnicity
0     Asian
1     Asian

Upvotes: 25

Daniel Martin
Daniel Martin

Reputation: 23548

You might try this:

data[data['ethnicity'].str.strip()=='Asian']

Upvotes: 6

Related Questions