kannerke
kannerke

Reputation: 1

urllib2 does not read entire page

A portion of code that I have that will parse a web site does not work.

I can trace the problem to the .read function of my urllib2.urlopen object.

page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()

Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.

I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.

I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.

Does anybody know why this may be happening or what I could try to determine the issue?

Upvotes: 0

Views: 3587

Answers (2)

Senthil Kumaran
Senthil Kumaran

Reputation: 56951

Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.

import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)

Upvotes: 0

Solon
Solon

Reputation: 41

The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:

import urllib2
import zlib

request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)

As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.

import requests

data = requests.get('http://magiccards.info/us/en.html').text

Upvotes: 4

Related Questions