svenwltr
svenwltr

Reputation: 18462

Problem with unescaping unicode strings

I have a problem with unescapting unicode string. I tried the following, but it doesn't work with unicode chars.

>>> s = ur"\'test\'"
>>> s.decode("string_escape")
"'test'"
>>> s = ur"\'test \u2014\'"
>>> s.decode("string_escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 7:
ordinal not in range(128)

Is there a better way to remove the backslashes?

Btw: I need this, because xmlrpclib.ServerProxy escapes the responses.

Edit: Here's an example for my xmlrpc request:

>>import xmlrpclib
>>server = xmlrpclib.ServerProxy("http://ws.audioscrobbler.com/2.0/")
>>xml_data = server.tag.search({'api_key':'...','tag':'80s'})
>>print xml_data
<?xml version=\"1.0\" encoding=\"utf-8\"?>
<lfm status=\"ok\">
<results for=\"80s\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\">
<opensearch:Query role=\"request\" searchTerms=\"80s\" startPage=\"1\" />
...

I think the escapes comes from the xmlrpc server.

Upvotes: 0

Views: 3273

Answers (2)

unutbu
unutbu

Reputation: 879869

Interestingly, the error you posted does not seem to occur using Python 2.6.4:

In [110]: s = ur"\'test\'"

In [111]: s.decode("string_escape")
Out[111]: "'test'"

In [112]: s = ur"\'test \u2014\'"

In [113]: s.decode("string_escape")
Out[113]: "'test \xe2\x80\x94'"

In [114]: print(s.decode("string_escape"))
'test —'

Upvotes: 0

Rosh Oxymoron
Rosh Oxymoron

Reputation: 21055

First, there's "string_escape" and "unicode_escape", either can't decode the string that you have given. The first reads a bytestring escaped as a bytestring, and decodes it as a bytestring. The second reads an unicode string escaped and saved in a bytestring, so it can't read an input unicode objects, at least not ones that do have unicode characters in them.

I believe that the raw string you've given here is wrong, and you actually want s.decode('unicode_escape') for the real strings coming from your source.

If I'm incorrect, the best you can do is to manually escape any unescaped single quotes with re, put extra single quotes around it and use ast.literal_eval.

def substitute(match):
    if len(match.group(1)) % 2 == 1:
        return match.group()
    else:
        return ur"%s\%s" % (match.group(1), match.group(2))

ast.literal_eval("'%s'" % re.sub(ur"(\\+)(')", substitute, s))

A third option is that the string needs to be passed to ast.literal_eval without any additional work on your part. Which of the three depends on what you exactly have as a string.

Another suspicion I have is that it might be a JSON object. You should give an example of the string that you're getting, and where are you getting it from and how.

Upvotes: 2

Related Questions