Reputation: 283
I have a simple script that scrapes Google for a link and then scrapes that link. However, some links contain dashes in them, and for some reason it comes out like this %25E2%2580%2593
in my script (in the url). So it would now look like this: http://myaddress.com/search?q=The_%25E2%2580%2593_World
when I want it to look like this http://myaddress.com/search?q=The_–_World
. How can I go about doing this? Should I be using UTF-8 encoding/decoding?
Edit:
I tried double unquoting (with reference to this link) but to no avail. Instead I get a result that looks like this: http://myaddress.com/search?q=The_–_World
.
Upvotes: 3
Views: 406
Reputation: 3192
The URL appears to be double URL encoded.
To decode to its raw form use the urllib libraries' parse.unquote
function to perform double URL decoding:
import urllib.parse
url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
urllib.parse.unquote(urllib.parse.unquote(url))
which decodes to the desired 'http://myaddress.com/search?q=The_–_World' URL.
EDIT:
As you have explained that you are using Python 2.7, the equivalent decode function would be unquote(url)
(refer to the documentation here).
import urllib
url = 'http://myaddress.com/search?q=The_%25E2%2580%2593_World'
print(urllib.unquote(urllib.unquote(url))).decode('utf-8')
Output:
http://myaddress.com/search?q=The_–_World
Upvotes: 3