Reputation: 674
I have the following function to detect strings in my data, I joined both the key and values of the dictionary since I want to find both values. I added ^ and $ because I only want exact matches.
Function
import pandas as pd
def check_direction(df):
# dict for all direction and their abbreviation
direction = {
'^Northwest$': '^NW$',
'^Northeast$': '^NE$',
'^Southeast$': '^SE$',
'^Southwest$': '^SW$',
'^North$': '^N$',
'^East$': '^E$',
"^South$": '^S$',
"^West$": "^W$"}
# combining all the dict pairs into one for str match
all_direction = direction.keys() | direction.values()
all_direction = '|'.join(all_direction)
df = df.astype(str)
df = pd.DataFrame(df.str.contains(all_direction, case = False))
return df
I ran tests on the following series which worked as intended:
tmp = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday'])
check_direction(tmp)
0 False
1 False
2 False
3 False
tmp = pd.Series(['SOUTH', 'NORTHEAST', 'WEST'])
check_direction(tmp)
0 True
1 True
2 True
However I ran into problems here:
tmp = pd.Series(['32 Street NE', 'Ogden Road SE'])
check_direction(tmp)
0 False
1 False
Both returned as false when it should be True because of NE and SE, how can I modify my code to make that happen?
Upvotes: 0
Views: 25
Reputation: 23140
I think you misunderstood what ^
and $
mean.
^
matches the beginning of the whole string,$
matches the end of the whole string.For example, 'Ogden Road SE'
does not match the pattern ^SE$
, because the string does not begin with SE
.
You probably meant to use word boundaries which are \b
.
So you should change ^SE$
to \bSE\b
, and so on.
You can make this less tedious and more readable by writing
direction = {
'Northwest': 'NW',
'Northeast': 'NE',
'Southeast': 'SE',
'Southwest': 'SW',
'North': 'N',
'East': 'E',
'South': 'S',
'West': 'W'}
all_direction = direction.keys() | direction.values()
all_direction = '|'.join(r'\b{}\b'.format(d) for d in all_direction)
Upvotes: 1