Reputation: 31
I need your help because want to use regex on a list to get only the string after my keyword.
my_list
looks like:
['Paris, 458 boulevard Saint-Germain', 'Marseille, 29 rue Camille Desmoulins', 'Marseille, 1 chemin des Aubagnens']
The regex:
re.compile(ur'(?<=rue|boulevard|quai|chemin).*', re.MULTILINE)
Expected list after processing:
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']
Thanks for you help.
Upvotes: 2
Views: 4589
Reputation: 5473
Lookbehinds need to have a constant width in python regex.
Try this simpler regex instead -
>>> regex = re.compile(ur'(?:rue|boulevard|quai|chemin)(.*)', re.MULTILINE)
>>> [re.findall(regex, e)[0].strip() for e in my_list]
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']
EDIT:
Using different lookbehinds can work -
(?:(?<=rue)|(?<=boulevard)|(?<=quai)|(?<=chemin))(.*)
Example -
>>> regex = re.compile(ur'(?:(?<=rue)|(?<=boulevard)|(?<=quai)|(?<=chemin))(.*)', re.MULTILINE)
>>> [re.findall(regex, e)[0].strip() for e in li]
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']
Upvotes: 2
Reputation: 626748
It seems your regex does not work in Python, the error it throws is look-behind requires fixed-width pattern
.
Also, please note that re.MULTILINE
flag in your regex is redundant as there is no ^
nor $
to re-define behavior for in the pattern.
Here is the code you can use:
import re
lst = ['Paris, 458 boulevard Saint-Germain', 'Marseille, 29 rue Camille Desmoulins', 'Marseille, 1 chemin des Aubagnens']
p = re.compile(r'.*(?:rue|boulevard|quai|chemin)')
print [p.sub('', x).strip() for x in lst]
Result:
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']
The r'.*(?:rue|boulevard|quai|chemin)'
regex matches
.*
- 0 or more any character(?:rue|boulevard|quai|chemin)
- 1 of the alternatives delimited with |
.and then the matched text is removed with re.sub
.
NOTE you can force whole word matching with \b
word boundary so that chemin
was matched and not chemins
:
r'.*\b(?:rue|boulevard|quai|chemin)\b'
Upvotes: 2