Reputation: 635
I have the following url encoded in utf-8.
url_input = u'https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-\xa3250pw-all-bills-included-/1174092955'
I need to scrap this webpage and to do so I need to have the following url_output (unicode is not read).
url_output=https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955
When I print url_input, I get url_output:
print(url_input)
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955
However I do not find a way to transform url_input to url_output. According to forums the print function uses ascii decoding on Python 2.7 but ascii is not supposed to read \xa3
and url_input.encode('ASCII')
does not work.
Does someone know how I can solve this problem ? Thanks in advance !
Upvotes: 2
Views: 2396
Reputation: 148870
After some tests, I can confirm that the server accepts the URL in different formats:
raw utf8 encoded URL:
url_output = url_input.encode('utf8')
%encoded latin1 URL
url_output = urllib.quote_plus(url_input.encode('latin1'), '/:')
%encoded utf8 URL
url_output = urllib.quote_plus(url_input.encode('utf8'), '/:')
As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. It gives:
print url_output
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955
Upvotes: 1
Reputation: 399
When you print url_input
you get the desired url_output
only because your terminal understand UTF-8 and can represents \xa3
correctly.
You can encode the string in ASCII with str.encode
, but you have to replace (with a ?
) or ignore the chars that does not are ascii:
url_output = url_input.encode("ascii", "replace")
print(url_output)
will prints:
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-?250pw-all-bills-included-/1174092955
and
url_output = url_input.encode("ascii", "ignore")
print(url_output)
will prints:
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-250pw-all-bills-included-/1174092955
You can not obtain an output string with a £
ascii character because the value of this character is greater than 127.
Upvotes: 2