maxutil
maxutil

Reputation: 195

Pandas - Filter Col with Object DType for a Couple Conditions

I have a column in a df with object dtype. I'm having some trouble filtering for just the ones with $xxxxxxx and CAD in the price field.

Price
$1,000,000
$2,000,000
$700,000
1,234,567 CAD
$111,111
3,000,000 EUR
Inquire
$500,000
Auction

I have tried this without success:

df = df[(df['Price'].str.contains('$')) | (df['Price'].str.contains('CAD'))]

If I want just the CAD, this works:

df = df[df['Price'].str.contains('CAD')

But, how do I get all values with just a $ and CAD? So remove 3 in my sample data above (EUR, Inquire, Auction).

Upvotes: 4

Views: 176

Answers (3)

Karn Kumar
Karn Kumar

Reputation: 8826

I see already we have answers by experts but just another approach for posterity sake.

>>> df[ df['Price'].str.startswith('$') | df['Price'].str.endswith('CAD') ]
           Price
0     $1,000,000
1     $2,000,000
2       $700,000
3  1,234,567 CAD
4       $111,111
7       $500,000

Upvotes: 1

jpp
jpp

Reputation: 164773

$ is a special character in regex, and pd.Series.str.contains enables regex by default. You can disable regex, use re.escape, or escape via \:

import re

# choose one of the below    
m1 = df['Price'].str.contains('$', regex=False)  # disable regex, most efficient
m1 = df['Price'].str.contains(re.escape('$'))    # escape via re.escape
m1 = df['Price'].str.contains('\$')              # escape via \

# turn off regex when not required for a performance boost
m2 = df['Price'].str.contains('CAD', regex=False)

print(df[m1 | m2])

           Price
0     $1,000,000
1     $2,000,000
2       $700,000
3  1,234,567 CAD
4       $111,111
7       $500,000

Most adaptable is to use regex with re.escape. For example:

L = ['$', 'CAD']
search_str = '|'.join(map(re.escape, L))
df = df[df['Price'].str.contains(search_str)]

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153500

Try using \ for escape character and | for or operations. pd.Series.str.contains where the pattern is default to use regex:

df[df['Price'].str.contains('\$|CAD')]

Output:

           Price
0     $1,000,000
1     $2,000,000
2       $700,000
3  1,234,567 CAD
4       $111,111
7       $500,000

And, if you also wanted to catch that 'EUR', use another |:

df[df['Price'].str.contains('\$|CAD|EUR')]

Upvotes: 2

Related Questions