Problem with unescaping unicode strings

Question

I have a problem with unescapting unicode string. I tried the following, but it doesn't work with unicode chars.

>>> s = ur"\'test\'"
>>> s.decode("string_escape")
"'test'"
>>> s = ur"\'test \u2014\'"
>>> s.decode("string_escape")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 7:
ordinal not in range(128)

Is there a better way to remove the backslashes?

Btw: I need this, because xmlrpclib.ServerProxy escapes the responses.

Edit: Here's an example for my xmlrpc request:

>>import xmlrpclib
>>server = xmlrpclib.ServerProxy("http://ws.audioscrobbler.com/2.0/")
>>xml_data = server.tag.search({'api_key':'...','tag':'80s'})
>>print xml_data




...

I think the escapes comes from the xmlrpc server.

Rosh Oxymoron · Accepted Answer

First, there's "string_escape" and "unicode_escape", either can't decode the string that you have given. The first reads a bytestring escaped as a bytestring, and decodes it as a bytestring. The second reads an unicode string escaped and saved in a bytestring, so it can't read an input unicode objects, at least not ones that do have unicode characters in them.

I believe that the raw string you've given here is wrong, and you actually want s.decode('unicode_escape') for the real strings coming from your source.

If I'm incorrect, the best you can do is to manually escape any unescaped single quotes with re, put extra single quotes around it and use ast.literal_eval.

def substitute(match):
    if len(match.group(1)) % 2 == 1:
        return match.group()
    else:
        return ur"%s\%s" % (match.group(1), match.group(2))

ast.literal_eval("'%s'" % re.sub(ur"(\+)(')", substitute, s))

A third option is that the string needs to be passed to ast.literal_eval without any additional work on your part. Which of the three depends on what you exactly have as a string.

Another suspicion I have is that it might be a JSON object. You should give an example of the string that you're getting, and where are you getting it from and how.

Problem with unescaping unicode strings

Answers (2)

Related Questions