Jb_Eyd
Jb_Eyd

Reputation: 635

Decoding UTF-8 to URL with Python

I have the following url encoded in utf-8.

url_input = u'https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-\xa3250pw-all-bills-included-/1174092955'

I need to scrap this webpage and to do so I need to have the following url_output (unicode is not read).

url_output=https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

When I print url_input, I get url_output:

print(url_input)
https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-£250pw-all-bills-included-/1174092955

However I do not find a way to transform url_input to url_output. According to forums the print function uses ascii decoding on Python 2.7 but ascii is not supposed to read \xa3 and url_input.encode('ASCII') does not work.

Does someone know how I can solve this problem ? Thanks in advance !

Upvotes: 2

Views: 2396

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148870

After some tests, I can confirm that the server accepts the URL in different formats:

  • raw utf8 encoded URL:

    url_output = url_input.encode('utf8')
    
  • %encoded latin1 URL

    url_output = urllib.quote_plus(url_input.encode('latin1'), '/:')
    
  • %encoded utf8 URL

    url_output = urllib.quote_plus(url_input.encode('utf8'), '/:')
    

As the raw latin1 in not accepted and leads to an incorrect URL error, and as passing non ascii characters in an URL may not be safe, my advice is to use this third way. It gives:

    print url_output

    https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-%C2%A3250pw-all-bills-included-/1174092955

Upvotes: 1

Urban
Urban

Reputation: 399

When you print url_input you get the desired url_output only because your terminal understand UTF-8 and can represents \xa3 correctly.

You can encode the string in ASCII with str.encode, but you have to replace (with a ?) or ignore the chars that does not are ascii:

url_output = url_input.encode("ascii", "replace")
print(url_output)

will prints:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-?250pw-all-bills-included-/1174092955

and

url_output = url_input.encode("ascii", "ignore")
print(url_output)

will prints:

https://www.gumtree.com//p/uk-holiday-rentals/1bedroon-flat-250pw-all-bills-included-/1174092955

You can not obtain an output string with a £ ascii character because the value of this character is greater than 127.

Upvotes: 2

Related Questions