user562427
user562427

Reputation: 79

Downloading amazon.co.uk webpage using only python, html exactly as firebug sees it

I noticed that using urllib to download a webpage:

http://www.amazon.co.uk/Darkness-II-Limited-PC-DVD/dp/B005ULLEX6

the content that I get back using urlopen( url ).read() is different from what firebug sees.

Example:

If you point firebug to the page's image area, it tells you a div id="prodImageCell" exists, however when looking through what python has opened, there is no such thing, therefore beautifulsoup doesn't find anything.

Is this because the images are generated using javascript?

Question:

If so is there a way of downloading pretty much the exact same thing firebug sees using urllib (and not using something like Selenium instead).

I am trying to fetch the source url of one of the images programmatically, example here is the div with prodImageCell has src=http://ecx.images-amazon.com/images/I/51uPDvJGS3L.AA300.jpg which is indeed the url to the image.

Answer:

can't answer properly because I don't have the reputation :(

Found the solution thanks to @huelbois for pointing me in the right direction, one needs to use user-agents headers.

Before

>>> import urllib2
>>> import re
>>> site = urllib2.urlopen('http://www.amazon.co.uk/\
Darkness-II-Limited-PC-DVD/dp/B005ULLEX6').read()
>>> re.search( 'prodImageCell', site )
>>>

After

>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101\
Firefox/7.0.1"
>>> headers = {'User-Agent':user_agent}
>>> req = urllib2.Request(url=url,headers=headers)
>>> site = urllib2.urlopen(req).read()
>>> re.search( 'prodImageCell', site )
<_sre.SRE_Match object at 0x01487DB0>

hurrah!

Upvotes: 1

Views: 503

Answers (1)

huelbois
huelbois

Reputation: 7012

Just tested it right now with wget (will work like urrlib). You HAVE to include User-Agent header to get the requested part:

wget -O- --header='User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:9.0.1) Gecko/20100101 Firefox/9.0.1' http://www.amazon.co.uk/Darkness-II-Limited-PC-DVD/dp/B005ULLEX6

returns the html page with the requested part.

oops: just saw right now you succeeded with my previous advice. Great!

Upvotes: 2

Related Questions