Xin
Xin

Reputation: 674

How to match strings and substrings from dictionary values

I have the following function to detect strings in my data, I joined both the key and values of the dictionary since I want to find both values. I added ^ and $ because I only want exact matches.

Function

import pandas as pd

def check_direction(df):
    # dict for all direction and their abbreviation
    direction = {
        '^Northwest$': '^NW$',
        '^Northeast$': '^NE$',
        '^Southeast$': '^SE$',
        '^Southwest$': '^SW$',
        '^North$': '^N$',
        '^East$': '^E$',
        "^South$": '^S$',
        "^West$": "^W$"}

    # combining all the dict pairs into one for str match
    all_direction = direction.keys() | direction.values()
    all_direction = '|'.join(all_direction)

    df = df.astype(str)
    df = pd.DataFrame(df.str.contains(all_direction, case = False))

    return df

I ran tests on the following series which worked as intended:

tmp = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday'])
check_direction(tmp)

0   False
1   False
2   False
3   False

tmp = pd.Series(['SOUTH', 'NORTHEAST', 'WEST'])
check_direction(tmp)

0   True
1   True
2   True

However I ran into problems here:

tmp = pd.Series(['32 Street NE', 'Ogden Road SE'])
check_direction(tmp)

0   False
1   False

Both returned as false when it should be True because of NE and SE, how can I modify my code to make that happen?

Upvotes: 0

Views: 25

Answers (1)

mkrieger1
mkrieger1

Reputation: 23140

I think you misunderstood what ^ and $ mean.

  • ^ matches the beginning of the whole string,
  • $ matches the end of the whole string.

For example, 'Ogden Road SE' does not match the pattern ^SE$, because the string does not begin with SE.

You probably meant to use word boundaries which are \b.

So you should change ^SE$ to \bSE\b, and so on.

You can make this less tedious and more readable by writing

direction = {
    'Northwest': 'NW',
    'Northeast': 'NE',
    'Southeast': 'SE',
    'Southwest': 'SW',
    'North': 'N',
    'East': 'E',
    'South': 'S',
    'West': 'W'}

all_direction = direction.keys() | direction.values()
all_direction = '|'.join(r'\b{}\b'.format(d) for d in all_direction)

Upvotes: 1

Related Questions