Reputation: 133
I have a DataFrame with thousands of rows and two columns like so:
string state
0 the best new york cheesecake rochester ny ny
1 the best dallas bbq houston tx random str tx
2 la jolla fish shop of san diego san diego ca ca
3 nothing here dc
For each state, I have a regular expression of all city names (in lower case) structured like (city1|city2|city3|...)
where the order of the cities is arbitrary (but can be changed if needed). For example, the regular expression for the state of New York contains both 'new york'
and 'rochester'
(and likewise 'dallas'
and 'houston'
for Texas, and 'san diego'
and 'la jolla'
for California).
I want to find out what the last appearing city in the string is (for observations 1, 2, 3, 4, I'd want 'rochester'
, 'houston'
, 'san diego'
, and NaN
(or whatever), respectively).
I started off with str.extract
and was trying to think of things like reversing the string but have reached an impasse.
Thanks so much for any help!
Upvotes: 2
Views: 1579
Reputation: 863801
You can use str.findall
, but if no match get empty list
, so need apply. Last select last item of string by [-1]
:
cities = r"new york|dallas|rochester|houston|san diego"
print (df['string'].str.findall(cities)
.apply(lambda x: x if len(x) >= 1 else ['no match val'])
.str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
(Corrected >= 1 to > 1.)
Another solution is a bit hack - add no match string to start of each string by radd
and add this string to cities too:
a = 'no match val'
cities = r"new york|dallas|rochester|houston|san diego" + '|' + a
print (df['string'].radd(a).str.findall(cities).str[-1])
0 rochester
1 houston
2 san diego
3 no match val
Name: string, dtype: object
Upvotes: 4
Reputation: 57145
cities = r"new york|dallas|..."
def last_match(s):
found = re.findall(cities, s)
return found[-1] if found else ""
df['string'].apply(last_match)
#0 rochester
#1 houston
#2 san diego
#3
Upvotes: 3