Fabrizio Silvestri
Fabrizio Silvestri

Reputation: 113

Python Regex - Matching mixed Unicode and ASCII characters in a string

I've tried in several different ways and none of them work.

Suppose I have a string s defined as follows:

s = '[မန္း],[aa]'.decode('utf-8')

Suppose I want to parse the two strings within the square brackes. I've compiled the following regex:

pattern = re.compile(r'\[(\w+)\]', re.UNICODE)

and then I look for occurrences using:

pattern.findall(s, re.UNICODE)

The result is basically just [] instead of the expected list of two matches. Furthermore if I remove the re.UNICODE from the findall call I get the single string [u'aa'], i.e. the non-unicode one:

pattern.findall(s)

Of course

s = '[bb],[aa]'.decode('utf-8')
pattern.findall(s)

returns [u'bb', u'aa']

And to make things even more interesting:

s = '[မနbb],[aa]'.decode('utf-8')
pattern.findall(s)

returns [u'\u1019\u1014bb', u'aa']

Upvotes: 0

Views: 3157

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177481

First, note that the following only works in Python 2.x if you've saved the source file in UTF-8 encoding, and you declare the source code encoding at the top of the file; otherwise, the default encoding of the source is assumed to be ascii:

#coding: utf8
s = '[မန္း],[aa]'.decode('utf-8')

A shorter way to write it is to code a Unicode string directly:

#coding: utf8
s = u'[မန္း],[aa]'

Next, \w matches alphanumeric characters. With the re.UNICODE flag it matches characters that are categorized as alphanumeric in the Unicode database. Not all of the characters in မန္း are alphanumeric. If you want whatever is between the brackets, use something like the following. Note the use of .*? for a non-greedy match of everything. It's also a good habit to use Unicode strings for all text, and raw strings in particular for regular expressions.

#coding:utf8
import re
s = u'[မန္း],[aa],[မနbb]'
pattern = re.compile(ur'\[(.*?)\]')
print re.findall(pattern,s)

Output:

[u'\u1019\u1014\u1039\u1038', u'aa', u'\u1019\u1014bb']

Note that Python 2 displays an unambiguous version of the strings in lists with escape codes for non-ASCII and non-printable characters.

To see the actual string content, print the strings, not the list:

for item in re.findall(pattern,s):
    print item

Output:

မန္း
aa
မနbb

Upvotes: 0

JohanL
JohanL

Reputation: 6891

It's actually rather simple. \w matches all alphanumeric characters and not all of the characters in your initial string are alphanumeric.

If you still want to match all characters between the brackets, one solution is to match everything but a closing bracket (]). This can be made as

import re
s = '[မန္း],[aa]'.decode('utf-8')
pattern = re.compile('\[([^]]+)\]', re.UNICODE)
re.findall(pattern, s)

where the [^]] creates a matching pattern of all characters except the ones following the circumflex (^) character.

Also, note that the re.UNICODE argument to re.compile is not necessary, since the pattern itself does not contain any unicode characters.

Upvotes: 0

Related Questions