How to pad a character with spaces when it falls between a unicode range?

Question

The goal is to pad a character with spaces when the issubset condition is met, e.g.

[in]:

subset = [chr(ordinal) for ordinal in range(ord(u'\u31c0'), ord(u'\u31ef'))]

text = '这是个小㇈㇋伙子'

[out]:

output_text = '这是个小 ㇈ ㇋ 伙子'

I could do it as such:

def issubset(uchar):
    if u'\u31c0' <= uchar <= u'\u31ef':
        return True
    return False

def pad_space_ifsubset(text):
    output = ""
    for ch in text:
        if issubset(ch):
            output +=  " " + ch + " "
        else:
            output += ch
    return output

text = '这是个小㇈㇋伙子'

pad_space_ifsubset(text)

But is there a simpler way to do this? Perhaps with regex?

randomir · Accepted Answer

You can use re.sub with a range pattern over the codepoints of interest, and a group backreference in the replacement string (\g<0> will substitute the entire substring matched, or in this case, a single character from the range):

import re

def pad_space_ifsubset(text):
    return re.sub(u'[\u31c0-\u31ef]', ' \g<0> ', text)

For example:

>>> text = u'这是个小㇈㇋伙子'
>>> print pad_space_ifsubset(text)
这是个小 ㇈  ㇋ 伙子

How to pad a character with spaces when it falls between a unicode range?

Answers (2)

Related Questions