Reputation: 1320
Using pandas 0.16.2 on python 2.7, OSX.
I read a data-frame from a csv file like this:
import pandas as pd
data = pd.read_csv("my_csv_file.csv",sep='\t', skiprows=(0), header=(0))
The output of data.dtypes
is:
name object
weight float64
ethnicity object
dtype: object
I was expecting string types for name, and ethnicity. But I found reasons here on SO on why they're "object" in newer pandas versions.
Now, I want to select rows based on ethnicity, for example:
data[data['ethnicity']=='Asian']
Out[3]:
Empty DataFrame
Columns: [name, weight, ethnicity]
Index: []
I get the same result with data[data.ethnicity=='Asian']
or data[data['ethnicity']=="Asian"]
.
But when I try the following:
data[data['ethnicity'].str.contains('Asian')].head(3)
I get the results I want.
However, I do not want to use "contains"- I would like to check for direct equality.
Please note that data[data['ethnicity'].str=='Asian']
raises an error.
Am I doing something wrong? How to do this correctly?
Upvotes: 23
Views: 53120
Reputation: 879739
There is probably whitespace in your strings, for example,
data = pd.DataFrame({'ethnicity':[' Asian', ' Asian']})
data.loc[data['ethnicity'].str.contains('Asian'), 'ethnicity'].tolist()
# [' Asian', ' Asian']
print(data[data['ethnicity'].str.contains('Asian')])
yields
ethnicity
0 Asian
1 Asian
To strip the leading or trailing whitespace off the strings, you could use
data['ethnicity'] = data['ethnicity'].str.strip()
after which,
data.loc[data['ethnicity'] == 'Asian']
yields
ethnicity
0 Asian
1 Asian
Upvotes: 25
Reputation: 23548
You might try this:
data[data['ethnicity'].str.strip()=='Asian']
Upvotes: 6