Reputation: 7233
I've scraped many websites and have often wondered why the response headers displayed in Firebug and the response headers returned by urllib.urlopen(url).info()
are often different in that Firebug reports MORE headers.
I encountered an interesting one today. I'm scraping a website by following a "search url" that fully loads (returns a 200 status code) before redirecting to a final page. The easiest way to perform the scrape would be to return the Location
response header and make another request. However, that particular header is absent when I run 'urllib.urlopen(url).info().
Here is the difference:
Firebug headers:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
Headers returned by my code:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
Please note: scrape_tools
is a module I wrote myself. The code contained in the module to retrieve headers is (basically) as follows:
class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
Is there a reason for the discrepancy? Am I making a silly mistake in my code? How can I retrieve the missing header data? I'm fairly new to Python, so please forgive any dumb errors.
Thanks in advance. Any advice is much appreciated!
Also...Sorry about the wall of code =\
Upvotes: 3
Views: 954
Reputation: 32560
You're not getting the same kind of response for the two requests. For example, the response to the Firefox request contains a Location:
header, so it's probably a 302 Moved temporarily
or a 301
. Those don't contain any actual body data, but instead cause your Firefox to issue a second request to the URL in the Location:
header (urllib doesn't do that).
The Firefox response also uses Connection : keep-alive
while the urllib request got answered with Connection: close
.
Also, the Firefox response is gzipped (Content-Encoding : gzip
), while the urllib one is not. That's probably because your Firefox sends a Accept-Encoding: gzip, deflate
header with its request.
Don't rely on Firebug to tell you HTTP headers (even though it does so truthfully most of the time), but use a sniffer like wireshark to inspect what's actually going over the wire.
You're obviously dealing with two different responses.
There could be several reasons for this. For one, web servers are supposed to respond differently depending on what Accept-Language
, Accept-Encoding
headers etc.. the client sends in its request. Then there's also the possibility that the server does some kind of User-Agent
sniffing.
Either way, capture your requests with urllib
as well as the ones with Firefox using wireshark and first compare the requests (not the headers, but the actual GET / HTTP/1.0
part. Are they really the same? If yes, move on to comparing request headers and start manually setting the same headers for the urllib
request until you figure out which headers make a difference.
Upvotes: 3