Reputation: 1263

python urllib unquote corrupt

How to let urllib unquote only valid % encoded string?

html_parser = HTMLParser.HTMLParser()
url = 'Time-@#*%ed%20&amp;'
print urllib2.unquote(url)
print html_parser.unescape(url)

result is

Time-@#*� &amp;
Time-@#*%ed%20&

urllib unquote '%20' to ' ', but it also wrong unquote '%ed' to '�'

HTMLParser can escape '&' to '&', but it can't convert '%20' to ' '

-------------- edit ------

I apologize for not explain my question very well, in fact I have many strings to process, some are URLs, some are not. The original string is Time-@#*%ed, I made the string to Time-@#*%ed%20& to contain both situation. It turns out it is hard to deal with both situation in a single line of code. After reading answers, I write my own function

#!/bin/env python
#coding: utf8

import sys
import os
import HTMLParser
import re
import urllib

html_parser = HTMLParser.HTMLParser()
url_pattern = re.compile('^(ftp|http|https)://.{4,}', flags=re.I)
def unquote_string(url):
    if url_pattern.search(url):
        while True:
            url1 = urllib.unquote(url)
            if url1 == url: break
            url = url1
    else:
        while True:
            url1 = html_parser.unescape(url)
            if url1 == url: break
            url = url1

    return url

url = 'Time-@#*%ed%20&amp;'
print urllib.unquote(url)
print html_parser.unescape(url)
print unquote_string(url)

Upvotes: 0

Answers (3)

7stud

Reputation: 48589

& is an html entity for use in an html page--not in a url. So url unquoting won't work on it.

On the other hand, %ed and %20 are url escapes that are formatted for transporting as part of a url, so html unescaping won't work on them.

If you want to convert both html entities and url escapes, you need to process each sequence separately:

import urllib 
import HTMLParser
import re

html_parser = HTMLParser.HTMLParser()

data = 'Time-@#*%ed%20&amp;'

pattern = r"""
      %               #Match a '%' sign, followed by...
      [0-9a-f]{2}     #two hex digits..
    |               #OR
      &               #an ampersand, followed by... 
      .*?             #any character, 0 or more times, non-greedy, followed by...
      ;               #a semi-colon
"""

regex = re.compile(pattern, flags=re.X | re.I)

def replace_func(match_obj):
    match = match_obj.group(0)

    if match.startswith('%'):
        my_str = urllib.unquote(match)
        my_str = unicode(my_str, 'iso-8859-1').encode('utf-8')

    elif match.startswith('&'):
        unicode_str = html_parser.unescape(match)
        my_str = unicode_str.encode('utf-8')

    return my_str

result = re.sub(regex, replace_func, data)
print result

--output:--
Time-@#*í &

One problem: to convert a sequence of random bytes like ed to a character, you have to know the encoding in which those bytes are supposed to represent a character. I just guessed--but you have to KNOW otherwise you will not generally be able to do conversions of strings like that.

Upvotes: 2

Serge Ballesta

Reputation: 148870

The problem is that %ed is a valid % encoded character, because ed is a valid hexadecimal value. If % is to be left untouched, it should be encoded as % or %. So your real problem is that your url string is not correctly encoded : if %ed is to be left untouched, the string should be :

url = 'Time-@#*&#37;ed%20&amp;'

As it is not correctly encoded (BTW, how did you got it ?) you cannot ask standard tools to be able to decode it correctly. How could unquote know that %20 must be processed but %ed must not ?

At that point, the best you can do is to build a custom decoder.

url2 = url.replace('%20', ' ')
print html_parser.unescape(url2)

which gives :

Time-@#*%ed &

Upvotes: 3

ErikR

Reputation: 52029

The string returned by unquote() is latin1 encoded. Try this:

import urllib2
url = 'Time-@#*%ed%20&amp;'
x = urllib2.unquote(url)
u = x.decode('iso-8859-1')
print u

u will be a unicode string.

According to the Wikipedia page on percent encoding (link) percent encoding may also be used to encode UTF-8 data, so you may need to use x.decode('utf-8') instead. It all depends on where this data is coming from and context.

Upvotes: 1

python urllib unquote corrupt

Answers (3)

Related Questions