Alex
Alex

Reputation: 3257

How to cope with diacritics while trying to match with regex in Python

Trying to use regular expression with unicode html escapes for diacritics:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
htmlstring=u'''/">čćđš</a>.../">España</a>'''

print re.findall( r'/">(.*?)</a', htmlstring, re.U )

produces :

[u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']

Any help, please?

Upvotes: 1

Views: 202

Answers (1)

Justin O Barber
Justin O Barber

Reputation: 11591

This appears to be an encoding question. Your code is working as it should. Were you expecting something different? Your strings that are prefixed with u are unicode literals. The characters that begin with \u are unicode characters followed by four hex digits, whereas the characters that begin with \x are unicode characters followed by only two hex digits. If you print out your results (instead of looking at their __repr__ method), you will see that you have received the result that it appears you were looking for:

results = [u'\u010d\u0107\u0111\u0161', u'Espa\xf1a']
for result in results:
    print result

čćđš
España

In your code (i.e. in your list), you see the representation of these unicode literals:

for result in results:
    print result.__repr__()

u'\u010d\u0107\u0111\u0161'        # what shows up in your list
u'Espa\xf1a'

Incidentally, it appears that you are trying to parse html with regexes. You should try BeautifulSoup or something similar instead. It will save you a major headache down the road.

Upvotes: 1

Related Questions