Reputation: 21721
I have block of code:( Django code )
list_temp = []
tagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE)
for key,tag in list.items():
if len(tag) > settings.FORM_MAX_LENGTH_OF_TAG or len(tag) < settings.FORM_MIN_LENGTH_OF_TAG:
raise forms.ValidationError(_('please use between %(min)s and %(max)s characters in you tags') % { 'min': settings.FORM_MIN_LENGTH_OF_TAG, 'max': settings.FORM_MAX_LENGTH_OF_TAG})
if not tagname_re.match(tag):
raise forms.ValidationError(_('please use following characters in tags: letters , numbers, and characters \'.-_\''))
# only keep one same tag
if tag not in list_temp and len(tag.strip()) > 0:
list_temp.append(tag)
This allow me to put the tag name in Unicode character.
But I don't know why with my Unicode (khmer uncode Khmer Symbols Range: 19E0–19FF The Unicode Standard, Version 4.0).I could not .
My question :
How can I change the above codetagname_re = re.compile(r'^[\w+\.-]+$', re.UNICODE)
to adapt my Unicode character.?Because if I input the tag with the "នយោបាយ" I got the message?
please use following characters in tags: letters , numbers, and characters \'.-_\''
Upvotes: 6
Views: 8424
Reputation: 20654
Have a look at the new regex implementation on PyPI:
http://pypi.python.org/pypi/regex
With Python 3 it says:
>>> import regex
>>> regex.match("\w", "\u17C4")
<_regex.Match object at 0x00F03988>
>>> regex.match("\w", "\u17B6")
<_regex.Match object at 0x00F03D08>
Upvotes: 4
Reputation: 467231
bobince's answer is certainly correct. However, before you hit that problem there might be another one - is tag
definitely a unicode
rather than a str
? For example:
>>> str_version = 'នយោបាយ'
>>> type(str_version)
<type 'str'>
>>> print str_version
នយោបាយ
>>> unicode_version = 'នយោបាយ'.decode('utf-8')
>>> type(unicode_version)
<type 'unicode'>
>>> print unicode_version
នយោបាយ
>>> r = re.compile(r'^(\w+)',re.U)
>>> r.search(str_version).group(1)
'\xe1'
>>> print r.search(str_version).group(1)
>>> r.search(unicode_version).group(1)
u'\1793\u1799'
>>> print r.search(unicode_version).group(1)
នយ
As another small point, in your regular expression the +
in the character class just means that a literal +
is also allowed in the sequence of letters and punctuation.
Upvotes: 5
Reputation: 536389
ោ (U+17C4 KHMER VOWEL SIGN OO) and ា (U+17B6 KHMER VOWEL SIGN AA) are not letters, they're combining marks, so they don't match \w.
Upvotes: 6