TheRookierLearner
TheRookierLearner

Reputation: 4163

How do I process a regular expression having unicode in Python?

So, I have this string str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk' in Python and I just want to extract the world-weather-online® part of it using regular expression. What I did is first match = re.search(r'([a-zA-Z0-9\-\%\+]+?)_[a-z]+', str) and then get the result in a string str2 = match.group(1).

However, I end up with the error 'NoneType' object has no attribute 'group'. If I just try it with the string "world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk", it works just fine. However, having the special unicode symbol creates a problem. I tried using match = re.search(ur'([a-zA-Z0-9\-\%\+]+?)_[a-z]+', str) but it still doesn't help. Any ideas on how to solve this one? Thanks!

Upvotes: 0

Views: 95

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121266

Use a Unicode regular expression and include the codepoint in your pattern:

match = re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr)

You may want to think about what other codepoints should be included, apart from the trademark ® codepoint.

Demo:

>>> import re
>>> yourstr = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
>>> print re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr).group(1)
world-weather-online®

Upvotes: 3

fredtantini
fredtantini

Reputation: 16556

Well, I think that you only forgot the ® in your regexp:

>>> match = re.search(r'([a-zA-Z0-9\-\%\+®+]+?)_[a-z]+', str)
>>> match.group(1)
u'world-weather-online\xae'

But if your string contains more unicode characters, your regexp can be long… So just re.search(r'(.*)_[a-z]+', str) can do the trick.

And if you just want to split wrt to the '_':

>>> str.split('_')[0]
u'world-weather-online\xae'

Upvotes: 2

Related Questions