python urllib.request not getting same html as my browser

Question

Trying to get html code of http://groupon.cl/descuentos/santiago-centro with the following python code:

import urllib.request
url="http://groupon.cl/descuentos/santiago-centro"
request = urllib.request.Request(url, headers = {'user-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response = urllib.request.urlopen(request)
return response.read().decode('utf-8')

I'm getting html code for a page which asks for my location. If I manually open the same link with my browser (having no cookies involved, even with a recently installed browser) I go directly to a page with discount promotions. It seems to be some redirect action that is not taken place for urllib. I am using the user-agent header to try to get the behaviour for a typical browser, but I have no luck.

How could I get the same html code as with my browser?

pexeer · Accepted Answer

I think you can run this command:

wget -d http://groupon.cl/descuentos/santiago-centro

and you will see the wget print two http request and save the response page to a file.

 -   HTTP/1.1 302 Moved Temporarily
 -   HTTP/1.1 200 OK

and the content of the file was html code of you want.

The first response code is 302, so urllib.requst.urlopen do a second request. But it dit not set the correct cookie which get from the first response, the server cannot undstand the second request, so you get another page.

The http.client module does not handle the 301 or 302 http reponse by himself.

import http

conn = http.client.HTTPConnection("groupon.cl")
#do first request
conn.request("GET", "/descuentos/santiago-centro")
print(conn.status)  # 301 or 302
print(conn.getheaders()) # set-Cookie

#get the cookie
headers = ....
#do second request

conn.requesst("GET", "/", headers)
......
......
#Get response page.

python urllib.request not getting same html as my browser

Answers (1)

Related Questions