Regular Expression with Umlauts in Python 2.7.5

Question

Why doesn't my regular expression work? I need to use python 2.7.5. This is my expression:

pattern = re.compile('\d{4};[a-zA-ZäöüÄÖÜß. -]+;.+')

I'm reading a csv file. At the start there must be 4 digits until a ;, and until the 2nd ; there must be letters from a-Z, umlauts and . or a space or ., then after the 2nd ; there can be any character.

Now my problem: In the second "part" it doesn't accept umlauts like äöü or ß. In the third "part" where I don't specify the umlauts, its no problem when they occur.

I did put # -*- coding: utf-8 -*- at the beginning of the script.

Martijn Pieters · Accepted Answer

By encoding to UTF-8, you entered a multibyte sequence into a character class:

>>> 'ä'
'\xc3\xa4'

Anything outside the ASCII character range requires more than one byte to encode.

Your character class will now match either the 0xC3 byte or a 0xA4 byte; your class contains more bytes, and it may match 'ä' but it could also match any other UTF-8 byte sequence with the C3 or A4 bytes.

You'd either have to explicitly match each UTF-8 byte pair (a real pain), or decode your data to Unicode strings first and use a Unicode regular expression:

re.compile(u'\d{4};[a-zA-ZäöüÄÖÜß. -]+;.+', flags=re.UNICODE)

Do pass in Unicode text when you use that regular expression

Regular Expression with Umlauts in Python 2.7.5

Answers (1)

Related Questions