Reputation: 122142
Given a mixed string of unicode and ascii chars, e.g.:
它看灵魂塑Nike造得和学问同等重要。
The goal is to pad the ascii substrings with spaces, i.e.:
它看灵魂塑 Nike 造得和学问同等重要。
I've tried using the ([^[:ascii:]])
regex, it looks fine in matching the substrings, e.g. https://regex101.com/r/FVHhU1/1
But in code, the substitution with ' \1 '
is not achieving the desired output.
>>> import re
>>> patt = re.compile('([^[:ascii:]])')
>>> s = u'它看灵魂塑Nike造得和学问同等重要。'
>>> print (patt.sub(' \1 ', s))
它看灵魂塑Nike造得和学问同等重要。
How to pad ascii characters with spaces in a mix unicode-ascii string?
Upvotes: 1
Views: 823
Reputation: 477180
The pattern should be:
([\x00-\x7f]+)
So you can use:
patt = re.compile('([\x00-\x7f]+)')
patt.sub(r' \1 ',s)
This generates:
>>> print(patt.sub(r' \1 ',s))
它看灵魂塑 Nike 造得和学问同等重要。
ASCII is defined as a range of characters with hex codes between 00
and 7f
. So we define such a range as [\x00-\x7f]
, use +
to denote one or more, and replace the matching group with r' \1 '
to add two spaces.
Upvotes: 2