Reputation: 6527
I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é
is in the string s
, re.search
always returns None
.
Note, I've tried using different combinations of .*
instead of [\sa-zA-Z]+
without success.
Upvotes: 0
Views: 124
Reputation:
The character ranges a-z
and A-Z
only capture ASCII characters. You can use .
to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip
:
>>> ' a '.strip()
'a'
>>>
Upvotes: 3