Extract unicode substrings with the re module

Question

I have a string like this:

s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'

I want this text:

result = 'the unicode text I want with an é'

I've tried to use this code:

expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result)  # just to strip out leading/trailing white space

But as long as the é is in the string s, re.search always returns None.

Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.

user2555451 · Accepted Answer

The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:

>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
 the unicode text I want with an é
>>>

Note too that I simplified your pattern a bit. Here is what it does:

BEGIN  # Matches BEGIN
(.+?)  # Captures one or more characters non-greedily
END    # Matches END

Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:

>>> ' a '.strip()
'a'
>>>

Extract unicode substrings with the re module

Answers (1)

Related Questions