devin
devin

Reputation: 6527

Extract unicode substrings with the re module

I have a string like this:

s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'

I want this text:

result = 'the unicode text I want with an é'

I've tried to use this code:

expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result)  # just to strip out leading/trailing white space

But as long as the é is in the string s, re.search always returns None.

Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.

Upvotes: 0

Views: 124

Answers (1)

user2555451
user2555451

Reputation:

The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:

>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
 the unicode text I want with an é
>>>

Note too that I simplified your pattern a bit. Here is what it does:

BEGIN  # Matches BEGIN
(.+?)  # Captures one or more characters non-greedily
END    # Matches END

Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:

>>> ' a '.strip()
'a'
>>>

Upvotes: 3

Related Questions