Reputation: 3257
Trying to use regular expression with unicode html escapes for diacritics:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''
print re.findall( r'/">(.*?)</a', htmlstring, re.U )
produces :
[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
Any help, please?
Upvotes: 1
Views: 202
Reputation: 11591
This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u
are unicode literals. The characters that begin with \u
are unicode characters followed by four hex digits, whereas the characters that begin with \x
are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__
method), you will see that you have received the result that it appears you were looking for:
results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
print result
čćđš
España
In your code (i.e. in your list), you see the representation of these unicode literals:
for result in results:
print result.__repr__()
u'\u010d\u0107\u0111\u0161' # what shows up in your list
u'Espa\xf1a'
Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup
or something similar instead. It will save you a major headache down the road.
Upvotes: 1