LittleBobbyTables
LittleBobbyTables

Reputation: 4473

Why does this Python RegEx pipe not pick out both unicode ranges?

A sample string containing both hiragana and katakana unicode characters:

myString = u"Eliminate ひらがな non-alphabetic カタカナ characters"

A pattern to match both ranges, according to: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

myPattern = u"[\u3041-\u309f]*|[\u30a0-\u30ff]*"

Simple Python regex replace function

import re
print re.sub(myPattern, "", myString)

Returns:

Eliminate  non-alphabetic カタカナ characters

The only way I can get it to work is if I use the two ranges separately, one after the other. What is stopping this RegEx from simply picking both sides of the |-pipe?

Upvotes: 1

Views: 152

Answers (2)

bpgergo
bpgergo

Reputation: 16037

>>> myPattern = u"[\u3041-\u309f]|[\u30a0-\u30ff]"
>>> print re.sub(myPattern, "", myString)
Eliminate  non-alphabetic  characters
>>> 

EDIT you can combine the two character classes with the OR operator as well

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1123420

You'll need to combine the ranges into one character class, otherwise it will match one or the other range, not both:

myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"

Demo:

>>> myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
>>> print re.sub(myPattern, "", u"Eliminate ひらがな non-alphabetic カタカナ characters")
Eliminate  non-alphabetic  characters

Upvotes: 6

Related Questions