Lachlan Mather
Lachlan Mather

Reputation: 283

Python trouble with dashes in web scraping

I have a simple script that scrapes Google for a link and then scrapes that link. However, some links contain dashes in them, and for some reason it comes out like this %25E2%2580%2593 in my script (in the url). So it would now look like this: http://myaddress.com/search?q=The_%25E2%2580%2593_World when I want it to look like this http://myaddress.com/search?q=The_–_World. How can I go about doing this? Should I be using UTF-8 encoding/decoding?

Edit:
I tried double unquoting (with reference to this link) but to no avail. Instead I get a result that looks like this: http://myaddress.com/search?q=The_–_World.

Upvotes: 3

Views: 406

Answers (1)

smoggers
smoggers

Reputation: 3192

The URL appears to be double URL encoded.

To decode to its raw form use the urllib libraries' parse.unquote function to perform double URL decoding:

import urllib.parse

url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
urllib.parse.unquote(urllib.parse.unquote(url))

which decodes to the desired 'http://myaddress.com/search?q=The_–_World' URL.

EDIT:

As you have explained that you are using Python 2.7, the equivalent decode function would be unquote(url) (refer to the documentation here).

import urllib

url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
print(urllib.unquote(urllib.unquote(url))).decode('utf-8')

Output:

http://myaddress.com/search?q=The_–_World

Upvotes: 3

Related Questions