Reputation: 4473
A sample string containing both hiragana and katakana unicode characters:
myString = u"Eliminate ひらがな non-alphabetic カタカナ characters"
A pattern to match both ranges, according to: http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
myPattern = u"[\u3041-\u309f]*|[\u30a0-\u30ff]*"
Simple Python regex replace function
import re
print re.sub(myPattern, "", myString)
Returns:
Eliminate non-alphabetic カタカナ characters
The only way I can get it to work is if I use the two ranges separately, one after the other. What is stopping this RegEx from simply picking both sides of the |-pipe?
Upvotes: 1
Views: 152
Reputation: 16037
>>> myPattern = u"[\u3041-\u309f]|[\u30a0-\u30ff]"
>>> print re.sub(myPattern, "", myString)
Eliminate non-alphabetic characters
>>>
EDIT you can combine the two character classes with the OR operator as well
Upvotes: 0
Reputation: 1123420
You'll need to combine the ranges into one character class, otherwise it will match one or the other range, not both:
myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
Demo:
>>> myPattern = u"[\u3041-\u309f\u30a0-\u30ff]*"
>>> print re.sub(myPattern, "", u"Eliminate ひらがな non-alphabetic カタカナ characters")
Eliminate non-alphabetic characters
Upvotes: 6