TheFirstLoop
TheFirstLoop

Reputation: 31

How to use regex in Python list

I need your help because want to use regex on a list to get only the string after my keyword.

my_list looks like:

 ['Paris, 458 boulevard Saint-Germain', 'Marseille, 29 rue Camille Desmoulins', 'Marseille, 1 chemin des Aubagnens']

The regex:

re.compile(ur'(?<=rue|boulevard|quai|chemin).*', re.MULTILINE)

Expected list after processing:

['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']

Thanks for you help.

Upvotes: 2

Views: 4589

Answers (2)

Kamehameha
Kamehameha

Reputation: 5473

Lookbehinds need to have a constant width in python regex.
Try this simpler regex instead -

>>> regex = re.compile(ur'(?:rue|boulevard|quai|chemin)(.*)', re.MULTILINE)
>>> [re.findall(regex, e)[0].strip() for e in my_list]
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']

EDIT:
Using different lookbehinds can work -

(?:(?<=rue)|(?<=boulevard)|(?<=quai)|(?<=chemin))(.*)

Example -

>>> regex = re.compile(ur'(?:(?<=rue)|(?<=boulevard)|(?<=quai)|(?<=chemin))(.*)', re.MULTILINE)
>>> [re.findall(regex, e)[0].strip() for e in li]
['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

It seems your regex does not work in Python, the error it throws is look-behind requires fixed-width pattern.

Also, please note that re.MULTILINE flag in your regex is redundant as there is no ^ nor $ to re-define behavior for in the pattern.

Here is the code you can use:

import re
lst =  ['Paris, 458 boulevard Saint-Germain', 'Marseille, 29 rue Camille Desmoulins', 'Marseille, 1 chemin des Aubagnens']
p = re.compile(r'.*(?:rue|boulevard|quai|chemin)')
print [p.sub('', x).strip() for x in lst]

IDEONE demo

Result:

['Saint-Germain', 'Camille Desmoulins', 'des Aubagnens']

The r'.*(?:rue|boulevard|quai|chemin)' regex matches

  • .* - 0 or more any character
  • (?:rue|boulevard|quai|chemin) - 1 of the alternatives delimited with |.

and then the matched text is removed with re.sub.

NOTE you can force whole word matching with \b word boundary so that chemin was matched and not chemins:

r'.*\b(?:rue|boulevard|quai|chemin)\b'

Upvotes: 2

Related Questions