Reputation: 4163
So, I have this string str = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
in Python and I just want to extract the world-weather-online®
part of it using regular expression. What I did is first match = re.search(r'([a-zA-Z0-9\-\%\+]+?)_[a-z]+', str)
and then get the result in a string str2 = match.group(1)
.
However, I end up with the error 'NoneType' object has no attribute 'group'
. If I just try it with the string "world-weather-online_jkpahjicmehopmlkbenbkmckcedlcmhk", it works just fine. However, having the special unicode symbol creates a problem. I tried using match = re.search(ur'([a-zA-Z0-9\-\%\+]+?)_[a-z]+', str)
but it still doesn't help. Any ideas on how to solve this one? Thanks!
Upvotes: 0
Views: 95
Reputation: 1121266
Use a Unicode regular expression and include the codepoint in your pattern:
match = re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr)
You may want to think about what other codepoints should be included, apart from the trademark ®
codepoint.
Demo:
>>> import re
>>> yourstr = u'world-weather-online®_jkpahjicmehopmlkbenbkmckcedlcmhk'
>>> print re.search(ur'([a-zA-Z0-9®%+-]+?)_[a-z]+', yourstr).group(1)
world-weather-online®
Upvotes: 3
Reputation: 16556
Well, I think that you only forgot the ® in your regexp:
>>> match = re.search(r'([a-zA-Z0-9\-\%\+®+]+?)_[a-z]+', str)
>>> match.group(1)
u'world-weather-online\xae'
But if your string contains more unicode characters, your regexp can be long… So just re.search(r'(.*)_[a-z]+', str)
can do the trick.
And if you just want to split wrt to the '_':
>>> str.split('_')[0]
u'world-weather-online\xae'
Upvotes: 2