Reputation: 10142
Regular expressions can search for character categories, like \s
for "all spaces". This does not work for Unicode, though. For example in Japanese, there are two extra space characters that \s
cannot match (half-width and double-width spaces).
Regex engines deal with such a case by supporting Unicode properties---basically extended character categories. Unfortunately engines must support these properties, and it depends on the programming language.
This question is about Python, especially the standard re
module, and the regex
module, a candidate drop-in replacement for the language that supports some Unicode properties. In fact the problem exists with Python 2.7 (tested with 2.7.14).
Here is a review with Python:
#coding: utf-8
import re
import regex
blank_re = re.compile('\s', re.UNICODE)
blank_regex = regex.compile('\p{Zs}', re.UNICODE)
cases = {
"NoSpaceHere": 0,
"Space Here": 1,
"何かスペースがない": 0,
"今回 はね,スペースがある": 1, # Single-width space (ASCII)
"今回 もね,スペースあり": 1 # Double-width space (UTF-8, Japanese)
}
for case in cases.keys():
res = blank_re.findall(case)
if len(res) != cases[case]:
print("[ re ] Failure on %s" % case)
res = blank_regex.findall(case)
if len(res) != cases[case]:
print("[ regex ] Failure on %s" % case)
Running this script with Python 2.7:
> python test_regex.py
[ re ] Failure on 今回 もね,スペースあり
[ regex ] Failure on 今回 もね,スペースあり
The script fails on the double-width space character, situated just after the 回 character. Note it fails on re
because it does not support Unicode properties, and on regex
for a different reason (perhaps no support for the space property).
Running the same script with Python 3.6.5 returns nothing---meaning all test cases pass.
Is there any way under Python 2.7 to match all spaces in the corresponding Unicode property?
Why just spaces? It turns out that SO is on similar issues for a long time already. Here is a question just for quotation marks, another one on spaces with Bash, with samples in Perl and Python 3, and a generic question about Unicode property matching in Python. So why just spaces? There are many Unicode properties to support, and implementations are usually partial (except Perl, apparently). Spaces is a pervasive class, so it has chances to have an implementation most of the time...
Note I have tried several scenarios, before settling down to the above review script, which seems readable enough... Earlier attempts include using \s
with regex
, removing re.UNICODE
, including the very double-width space in the regex itself, etc.
Upvotes: 1
Views: 807
Reputation: 79
Python 2.7 requires a bit more effort to handle unicode as it uses ascii strings for the default string type. While python 3.* has all strings as unicode. In 2.7, to still allow for unicode characters to appear in this format it'll encode the unicode characters into a hex string for later conversion.
The encoding at the top only tells python how to read the encoding for the file. It does not specify how python will encode strings in memory.
Once the case strings are encoded in unicode, the regex will work. A unicode regex will want unicode encoding. Literal strings can be converted by prepending u
or string variables by .encode('utf-8')
cases = {
u"NoSpaceHere": 0,
u"Space Here": 1,
u"何かスペースがない": 0,
u"今回 はね,スペースがある": 1, # Single-width space (ASCII)
u"今回 もね,スペースあり": 1 # Double-width space (UTF-8, Japanese)
}
It's easiest to see the difference by printing out the representation of each string.
print repr("今回 もね,スペースあり")
'\xe4\xbb\x8a\xe5\x9b\x9e\xe3\x80\x80\xe3\x82\x82\xe3\x81\xad\xef\xbc\x8c\xe3\x82\xb9\xe3\x83\x9a\xe3\x83\xbc\xe3\x82\xb9\xe3\x81\x82\xe3\x82\x8a'
print repr(u"今回 もね,スペースあり")
u'\u4eca\u56de\u3000\u3082\u306d\uff0c\u30b9\u30da\u30fc\u30b9\u3042\u308a'
Upvotes: 1