How to match all Unicode spaces with Python 2.7.x regular expressions?

Question

Regular expressions can search for character categories, like \s for "all spaces". This does not work for Unicode, though. For example in Japanese, there are two extra space characters that \s cannot match (half-width and double-width spaces).

Regex engines deal with such a case by supporting Unicode properties---basically extended character categories. Unfortunately engines must support these properties, and it depends on the programming language.

This question is about Python, especially the standard re module, and the regex module, a candidate drop-in replacement for the language that supports some Unicode properties. In fact the problem exists with Python 2.7 (tested with 2.7.14).

Here is a review with Python:

#coding: utf-8

import re
import regex

blank_re    = re.compile('\s', re.UNICODE)
blank_regex = regex.compile('\p{Zs}', re.UNICODE)

cases = {
    "NoSpaceHere": 0,
    "Space Here": 1,
    "何かスペースがない": 0,
    "今回 はね，スペースがある": 1, # Single-width space (ASCII)
    "今回　もね，スペースあり": 1   # Double-width space (UTF-8, Japanese)
}
for case in cases.keys():
    res = blank_re.findall(case)
    if len(res) != cases[case]:
        print("[  re   ] Failure on %s" % case)
    res = blank_regex.findall(case)
    if len(res) != cases[case]:
        print("[ regex ] Failure on %s" % case)

Running this script with Python 2.7:

> python test_regex.py
[  re   ] Failure on 今回　もね，スペースあり
[ regex ] Failure on 今回　もね，スペースあり

The script fails on the double-width space character, situated just after the 回 character. Note it fails on re because it does not support Unicode properties, and on regex for a different reason (perhaps no support for the space property).

Running the same script with Python 3.6.5 returns nothing---meaning all test cases pass.

Is there any way under Python 2.7 to match all spaces in the corresponding Unicode property?

Why just spaces? It turns out that SO is on similar issues for a long time already. Here is a question just for quotation marks, another one on spaces with Bash, with samples in Perl and Python 3, and a generic question about Unicode property matching in Python. So why just spaces? There are many Unicode properties to support, and implementations are usually partial (except Perl, apparently). Spaces is a pervasive class, so it has chances to have an implementation most of the time...

Note I have tried several scenarios, before settling down to the above review script, which seems readable enough... Earlier attempts include using \s with regex, removing re.UNICODE, including the very double-width space in the regex itself, etc.

How to match all Unicode spaces with Python 2.7.x regular expressions?

Answers (1)

Related Questions