Unicode Substitutions using Regex , Python

Question

I have a string as follows:

str1 = "heylisten\uff08there is something\uff09to say \uffa9"

I need to replace the unicode values detected by my regex expression with spaces on either sides.

Desired output string:

out = "heylisten \uff08 there is something \uff09 to say  \uffa9 "

I have used an re.findall to get all the matches and then replace them. It looks like:

p1 = re.findall(r'\uff[0-9a-e][0-9]', str1, flags = re.U)  
out = str1
for item in p1:
    print item
    print out
    out= re.sub(item, r" " + item + r" ", out)

And this outputs:

'heylisten\ uff08 there is something\ uff09 to say \ uffa9 '

What is wrong with the above that it prints an extra "" and also separates it from uff? I even tried with re.search but it seems to only separate \uff08. Is there a better way?

Ignacio Vazquez-Abrams · Accepted Answer

I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values ...

You don't have any unicode values. You have a bytestring.

str1 = u"heylisten\uff08there is something\uff09to say \uffa9"
 ...
p1 = re.sub(ur'([\uff00-\uffe9])', r' \1 ', str1)

Unicode Substitutions using Regex , Python

Answers (2)

Related Questions