gdiazc
gdiazc

Reputation: 2148

How to compare these two strings in Python?

I have a file with the following two strings:

25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0
25_\xD1\x80\xD0\xB0\xD1\x88\xD3\x99\xD0\xB0\xD1\x80\xD0\xB0

They both represent the same URL path, and therefore should be equal. I would like to apply the same "cleaning function" to both of them, obtaining the same string.

After reading these strings from the file I have:

>> s0
'25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0'
>> s1
'2_\\xD1\\x80\\xD0\\xB0\\xD1\\x88\\xD3\\x99\\xD0\\xB0\\xD1\\x80\\xD0\\xB0'

(note the escaped backslashes in s1). If I unquote s0 I get the following:

>> import urllib
>> t0 = urllib.unquote(s0)
'25_\xd1\x80\xd0\xb0\xd1\x88\xd3\x99\xd0\xb0\xd1\x80\xd0\xb0'
>> print t0
25_рашәара

which is good. However, the only thing I know to do on s1 is the following:

>> t1 = s1.decode("unicode_escape")
u'2_\xd1\x80\xd0\xb0\xd1\x88\xd3\x99\xd0\xb0\xd1\x80\xd0\xb0'
>> print t1
2_ÑаÑÓаÑ

which looks broken. My question is: what clean(s) function could be written to normalize these two strings, so they either are both <type 'str'> or both <type 'unicode'> and the both print equally (and compare equally as well)?

Upvotes: 0

Views: 108

Answers (1)

georg
georg

Reputation: 214949

Consider:

>>> s0 = '25_%D1%80%D0%B0%D1%88%D3%99%D0%B0%D1%80%D0%B0'
>>> s1 = '25_\\xD1\\x80\\xD0\\xB0\\xD1\\x88\\xD3\\x99\\xD0\\xB0\\xD1\\x80\\xD0\\xB0'
>>> import urllib
>>> t0 = urllib.unquote(s0).decode('utf8')
>>> t1 = s1.decode('string_escape').decode('utf8')
>>> print t0
25_рашәара
>>> print t1
25_рашәара
>>> t0 == t1
True
>>> 

Upvotes: 2

Related Questions