Ying
Ying

Reputation: 1990

issue with content in urllib2.urlopen

I have some simple python code that makes request to a server

html_page = urllib2.urlopen(baseurl, timeout=20)
print html_page.read()
html_page.close()

when i am trying to scrape a page that has a '-'(dash) character in it. It is a dash in the browser, but when i try to print out the request of the response of urlopen it prints out as '?'. I tried recreating the html page with a local file, copying the afflicted text over from source, but I could not recreate it.

What other factors/variables might be in play? Could this have something to do with encoding?

UPDATE: I now know this problem is about encoding. the website i encoded in 'iso-8859-1'. the problem is i still cannot decode it, even after following Python: Converting from ISO-8859-1/latin1 to UTF-8

The character, when decoded, gives me:

>>>text.decode("iso-8859-1")
  u"</strong><p>Let's\x97in "
>>> text.decode("iso-8859-1").encode("utf8")
  "</strong><p>Let's\xc2\x97in "
>>> print text.decode("iso-8859-1").encode("utf8")
  </strong><p>Let'sin

The character just completely disappears. Anyone have any ideas?

Upvotes: 0

Views: 1049

Answers (1)

Ying
Ying

Reputation: 1990

So thanks to Adam Rosenfield, I figured out my problem. The website indicated the charset was iso-8859-1

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

bu! the character I had an issue with was "em dash", encoded in Windows-1252

>>> text.decode("windows-1252")
  </strong><p>Let's\u2014in"
>>> print text.decode("windows-1252")
  </strong><p>Let's—in

Thanks guys!

Upvotes: 1

Related Questions