Reputation: 130
Trying to get html code of http://groupon.cl/descuentos/santiago-centro with the following python code:
import urllib.request
url="http://groupon.cl/descuentos/santiago-centro"
request = urllib.request.Request(url, headers = {'user-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'})
response = urllib.request.urlopen(request)
return response.read().decode('utf-8')
I'm getting html code for a page which asks for my location. If I manually open the same link with my browser (having no cookies involved, even with a recently installed browser) I go directly to a page with discount promotions. It seems to be some redirect action that is not taken place for urllib. I am using the user-agent header to try to get the behaviour for a typical browser, but I have no luck.
How could I get the same html code as with my browser?
Upvotes: 2
Views: 1699
Reputation: 705
I think you can run this command:
wget -d http://groupon.cl/descuentos/santiago-centro
and you will see the wget print two http request and save the response page to a file.
- HTTP/1.1 302 Moved Temporarily
- HTTP/1.1 200 OK
and the content of the file was html code of you want.
The first response code is 302, so urllib.requst.urlopen
do a second request. But it dit not
set the correct cookie which get from the first response, the server cannot undstand the
second request, so you get another page.
The http.client module does not handle the 301 or 302 http reponse by himself.
import http
conn = http.client.HTTPConnection("groupon.cl")
#do first request
conn.request("GET", "/descuentos/santiago-centro")
print(conn.status) # 301 or 302
print(conn.getheaders()) # set-Cookie
#get the cookie
headers = ....
#do second request
conn.requesst("GET", "/", headers)
......
......
#Get response page.
Upvotes: 1