alvas
alvas

Reputation: 122142

Padding ascii characters with spaces in a mix unicode-ascii string

Given a mixed string of unicode and ascii chars, e.g.:

它看灵魂塑Nike造得和学问同等重要。

The goal is to pad the ascii substrings with spaces, i.e.:

它看灵魂塑 Nike 造得和学问同等重要。

I've tried using the ([^[:ascii:]]) regex, it looks fine in matching the substrings, e.g. https://regex101.com/r/FVHhU1/1

But in code, the substitution with ' \1 ' is not achieving the desired output.

>>> import re
>>> patt = re.compile('([^[:ascii:]])')
>>> s = u'它看灵魂塑Nike造得和学问同等重要。'
>>> print (patt.sub(' \1 ', s))
它看灵魂塑Nike造得和学问同等重要。

How to pad ascii characters with spaces in a mix unicode-ascii string?

Upvotes: 1

Views: 823

Answers (1)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477180

The pattern should be:

([\x00-\x7f]+)

So you can use:

patt = re.compile('([\x00-\x7f]+)')
patt.sub(r' \1 ',s)

This generates:

>>> print(patt.sub(r' \1 ',s))
它看灵魂塑 Nike 造得和学问同等重要。

ASCII is defined as a range of characters with hex codes between 00 and 7f. So we define such a range as [\x00-\x7f], use + to denote one or more, and replace the matching group with r' \1 ' to add two spaces.

Upvotes: 2

Related Questions