How can I write a regular expression that validates true only if all characters are Japanese/not ASCII?

Question

I have a list of terms on a Drupal site:

Art
美術
Sports
スポーツ

I need to filter these terms by regular expression and the expression needs to evaluate true only for the Japanese terms (美術, スポーツ).

The following conditions are true:

All English terms will only consist of upper and lower case letters.
All Japanese terms will only consist of Japanese characters (kanji and kana).

I have written a few regular expressions before but I have no idea how to handle Unicode. An expression like [a-zA-Z]* grabs all terms, including the Japanese ones.

jcomeau_ictx · Accepted Answer

Using the ranges here: http://en.wikipedia.org/wiki/Japanese_writing_system

>>> import re
>>> kanji = map(unichr, range(0x4e00, 0x9fbf + 1))
>>> katakana = map(unichr, range(0x30a0, 0x30ff + 1))
>>> hiragana = map(unichr, range(0x3040, 0x309f + 1))
>>> japanese = ''.join(kanji + katakana + hiragana)
>>> pattern = r'^[%s\s]+$' % japanese
>>> re.compile(pattern, re.U).match('スポーツ'.decode('utf8'))
<_sre.SRE_Match object at 0x9e6a090>
>>> re.compile(pattern, re.U).match('スポーツtest'.decode('utf8'))
>>>

This of course is Python, hopefully you can modify it for the language of your choice.

What is probably key is using the anchors ^ and $ to make sure you're matching the entire string. The reason [a-zA-Z]* matches all terms is because '*' means "0 or more". Also, making sure to decode any input strings, because if it's encoded to UTF-8 it won't match. The 'U' flag is not really necessary in this case because you're not asking the regex engine to decide \w, \W, \b, \B for you.

Also, after re-reading your question you don't expect spaces in the input, so you could get rid of the '\s' in the regex.

How can I write a regular expression that validates true only if all characters are Japanese/not ASCII?

Answers (1)

Related Questions