Jonathan Vanasco
Jonathan Vanasco

Reputation: 15680

python2 to python3 migration issue with unicode and bytes

I updated a Python2 package to support Python3, and am stuck on handling a single test case that fails under Python3 due to some encoding issues. The package generally deals with URL standardization and does some custom transformations before or after offloading to a few libraries on PyPi.

In Python2 I might have two strings which are both encodings of the same URL as such:

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

for which the following are true:

url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True

After a bunch of miscellaneous routes, they are both standardized to a punycode

url_result = 'http://xn--hgi.ws/%E2%99%A5'

Under Python3 I am hitting a wall because url_a.encode('utf-8') returns a bytestring, which is the required declaration when defining the variable in this format too.

url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True

I can not figure out a way to perform operations on url_b to have it encoded/decoded as I require it to be.

Which I could just define my test case with a bytestring declaration and everything will pass in both environments...

url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

there is still a possibility of something breaking in production because of data in messaging queues or databases that has not been processed yet.

essentially, in Python3, I need to detect that a short string such as

url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

should have been declared as a bytestring

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

and convert it properly, because it is interpreted as

url_b
>>> 'http://â\x9e¡.ws/â\x99¥'

edit: The closest I've come is url_b.decode('unicode-escape') which generates b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'

Upvotes: 0

Views: 1301

Answers (2)

jez
jez

Reputation: 15329

You want .encode(), not .decode(), and 'raw_unicode_escape':

#!/usr/bin/env python
# -*- coding: utf-8 -*-

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

encoded_a = url_a.encode('utf-8')
try:
    # Python 3
    encoded_b = url_b.encode('raw_unicode_escape')
except UnicodeDecodeError:
    # Python 2
    encoded_b = url_b

print(repr(encoded_a))
print(repr(encoded_b))

# Output is as follows (without the leading 'b' in Python 2):
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

Upvotes: 2

Olvin Roght
Olvin Roght

Reputation: 7812

Code:

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
print(url_b.decode("utf-8"))

Output:

http://➡.ws/♥

Upvotes: 0

Related Questions