Reputation: 195
I have a column in a df with object dtype. I'm having some trouble filtering for just the ones with $xxxxxxx and CAD in the price field.
Price
$1,000,000
$2,000,000
$700,000
1,234,567 CAD
$111,111
3,000,000 EUR
Inquire
$500,000
Auction
I have tried this without success:
df = df[(df['Price'].str.contains('$')) | (df['Price'].str.contains('CAD'))]
If I want just the CAD, this works:
df = df[df['Price'].str.contains('CAD')
But, how do I get all values with just a $ and CAD? So remove 3 in my sample data above (EUR, Inquire, Auction).
Upvotes: 4
Views: 176
Reputation: 8826
I see already we have answers by experts but just another approach for posterity sake.
>>> df[ df['Price'].str.startswith('$') | df['Price'].str.endswith('CAD') ]
Price
0 $1,000,000
1 $2,000,000
2 $700,000
3 1,234,567 CAD
4 $111,111
7 $500,000
Upvotes: 1
Reputation: 164773
$
is a special character in regex, and pd.Series.str.contains
enables regex by default. You can disable regex, use re.escape
, or escape via \
:
import re
# choose one of the below
m1 = df['Price'].str.contains('$', regex=False) # disable regex, most efficient
m1 = df['Price'].str.contains(re.escape('$')) # escape via re.escape
m1 = df['Price'].str.contains('\$') # escape via \
# turn off regex when not required for a performance boost
m2 = df['Price'].str.contains('CAD', regex=False)
print(df[m1 | m2])
Price
0 $1,000,000
1 $2,000,000
2 $700,000
3 1,234,567 CAD
4 $111,111
7 $500,000
Most adaptable is to use regex with re.escape
. For example:
L = ['$', 'CAD']
search_str = '|'.join(map(re.escape, L))
df = df[df['Price'].str.contains(search_str)]
Upvotes: 2
Reputation: 153500
Try using \
for escape character and |
for or operations. pd.Series.str.contains where the pattern is default to use regex:
df[df['Price'].str.contains('\$|CAD')]
Output:
Price
0 $1,000,000
1 $2,000,000
2 $700,000
3 1,234,567 CAD
4 $111,111
7 $500,000
And, if you also wanted to catch that 'EUR', use another |
:
df[df['Price'].str.contains('\$|CAD|EUR')]
Upvotes: 2