user88999
user88999

Reputation: 133

str.extract starting from the back in pandas DataFrame

I have a DataFrame with thousands of rows and two columns like so:

                                          string       state
0      the best new york cheesecake rochester ny          ny
1      the best dallas bbq houston tx random str          tx
2   la jolla fish shop of san diego san diego ca          ca
3                                   nothing here          dc

For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...) where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york' and 'rochester' (and likewise 'dallas' and 'houston' for Texas, and 'san diego' and 'la jolla' for California).

I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester', 'houston', 'san diego', and NaN (or whatever), respectively).

I started off with str.extract and was trying to think of things like reversing the string but have reached an impasse.

Thanks so much for any help!

Upvotes: 2

Views: 1579

Answers (2)

jezrael
jezrael

Reputation: 863801

You can use str.findall, but if no match get empty list, so need apply. Last select last item of string by [-1]:

cities = r"new york|dallas|rochester|houston|san diego"

print (df['string'].str.findall(cities)
                   .apply(lambda x: x if len(x) >= 1 else ['no match val'])
                   .str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

(Corrected >= 1 to > 1.)

Another solution is a bit hack - add no match string to start of each string by radd and add this string to cities too:

a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a

print (df['string'].radd(a).str.findall(cities).str[-1])
0       rochester
1         houston
2       san diego
3    no match val
Name: string, dtype: object

Upvotes: 4

DYZ
DYZ

Reputation: 57145

cities = r"new york|dallas|..."

def last_match(s):
    found = re.findall(cities, s)
    return found[-1] if found else ""

df['string'].apply(last_match)
#0    rochester
#1      houston
#2    san diego
#3         

Upvotes: 3

Related Questions