Reputation: 15680
I updated a Python2 package to support Python3, and am stuck on handling a single test case that fails under Python3 due to some encoding issues. The package generally deals with URL standardization and does some custom transformations before or after offloading to a few libraries on PyPi.
In Python2 I might have two strings which are both encodings of the same URL as such:
url_a = u'http://➡.ws/♥'
url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
for which the following are true:
url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True
After a bunch of miscellaneous routes, they are both standardized to a punycode
url_result = 'http://xn--hgi.ws/%E2%99%A5'
Under Python3 I am hitting a wall because url_a.encode('utf-8')
returns a bytestring
, which is the required declaration when defining the variable in this format too.
url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True
I can not figure out a way to perform operations on url_b to have it encoded/decoded as I require it to be.
Which I could just define my test case with a bytestring declaration and everything will pass in both environments...
url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
there is still a possibility of something breaking in production because of data in messaging queues or databases that has not been processed yet.
essentially, in Python3, I need to detect that a short string such as
url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
should have been declared as a bytestring
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
and convert it properly, because it is interpreted as
url_b
>>> 'http://â\x9e¡.ws/â\x99¥'
edit: The closest I've come is url_b.decode('unicode-escape')
which generates b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'
Upvotes: 0
Views: 1301
Reputation: 15329
You want .encode()
, not .decode()
, and 'raw_unicode_escape'
:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
url_a = u'http://➡.ws/♥'
url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
encoded_a = url_a.encode('utf-8')
try:
# Python 3
encoded_b = url_b.encode('raw_unicode_escape')
except UnicodeDecodeError:
# Python 2
encoded_b = url_b
print(repr(encoded_a))
print(repr(encoded_b))
# Output is as follows (without the leading 'b' in Python 2):
# b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
# b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
Upvotes: 2
Reputation: 7812
Code:
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
print(url_b.decode("utf-8"))
Output:
http://➡.ws/♥
Upvotes: 0