Why does this Python RegEx pipe not pick out both unicode ranges?

Question

A sample string containing both hiragana and katakana unicode characters:

myString = u"Eliminate ひらがな non-alphabetic カタカナ characters"

A pattern to match both ranges, according to: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

myPattern = u"[\u3041-\u309f]*|[\u30a0-\u30ff]*"

Simple Python regex replace function

import re
print re.sub(myPattern, "", myString)

Returns:

Eliminate  non-alphabetic カタカナ characters

The only way I can get it to work is if I use the two ranges separately, one after the other. What is stopping this RegEx from simply picking both sides of the |-pipe?

Martijn Pieters · Accepted Answer

You'll need to combine the ranges into one character class, otherwise it will match one or the other range, not both:

myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"

Demo:

>>> myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
>>> print re.sub(myPattern, "", u"Eliminate ひらがな non-alphabetic カタカナ characters")
Eliminate  non-alphabetic  characters

Why does this Python RegEx pipe not pick out both unicode ranges?

Answers (2)

Related Questions