Reputation: 139
Why doesn't my regular expression work? I need to use python 2.7.5. This is my expression:
pattern = re.compile('\d{4};[a-zA-ZäöüÄÖÜß. -]+;.+')
I'm reading a csv file. At the start there must be 4 digits until a ;
, and until the 2nd ;
there must be letters from a-Z, umlauts and .
or a space or .
, then after the 2nd ;
there can be any character.
Now my problem: In the second "part" it doesn't accept umlauts like äöü
or ß
. In the third "part" where I don't specify the umlauts, its no problem when they occur.
I did put # -*- coding: utf-8 -*-
at the beginning of the script.
Upvotes: 0
Views: 870
Reputation: 1123790
By encoding to UTF-8, you entered a multibyte sequence into a character class:
>>> 'ä'
'\xc3\xa4'
Anything outside the ASCII character range requires more than one byte to encode.
Your character class will now match either the 0xC3 byte or a 0xA4 byte; your class contains more bytes, and it may match 'ä'
but it could also match any other UTF-8 byte sequence with the C3 or A4 bytes.
You'd either have to explicitly match each UTF-8 byte pair (a real pain), or decode your data to Unicode strings first and use a Unicode regular expression:
re.compile(u'\d{4};[a-zA-ZäöüÄÖÜß. -]+;.+', flags=re.UNICODE)
Do pass in Unicode text when you use that regular expression
Upvotes: 3