Python + Regex + UTF-8 doesn't recognize accents

Question

My problem is that Python, using regex and re.search() doesn't recognize accents even though I use utf-8. Here is my string of code;

#! /usr/bin/python
-*- coding: utf-8 -*-
import re

htmlString = ' Fine, thank you. 
 Molt bé, gràcies.'

SearchStr = '(\<\/dd\>\)+ ([\w+\,\.\s]+)([\&\#\d\;]+)(\<\/dt\>\)+ (\w+) (\w+)'

Result = re.search(SearchStr, htmlString)

if Result:
print Result.groups()

passavol23:jO$ catalanword.py
('
', 'Fine, thank you.', ' ', '', 'Molt', 'b')

So the problem is that it doesn't recognizes the é and thus stops. Any help would be appreciated. Im a Python beginner.

Martijn Pieters · Accepted Answer

By default, \w only matches ascii characters, it translates to [a-zA-Z0-9_]. And matching UTF-8 bytes using regular expressions is hard enough, let alone only matching word characters, you'd have to match byte ranges instead.

You'll need to decode from UTF-8 to unicode and use the re.UNICODE flag instead:

>>> re.search(SearchStr, htmlString.decode('utf8'), re.UNICODE).groups()
(u'', u'Fine, thank you.', u' ', u'', u'Molt', u'b\xe9')

However, you should really be using a HTML parser to deal with HTML instead. Use BeautifulSoup, for example. It'll handle encoding and Unicode correctly for you.

Python + Regex + UTF-8 doesn't recognize accents

Answers (1)

Related Questions

Python + Regex + UTF-8 doesn&#39;t recognize accents

Answers (1)

Related Questions

Python + Regex + UTF-8 doesn't recognize accents