Reputation: 531
I've been trying to use Scrapy to recover the link of the first image of a query
in Google Images
.
For example, I've been trying to recover the first link for this specific query: Emma Watson - Google Images. To formulate the Xpath I've been using Xpath Helper which is a extension for Google Chrome. The First XPath I tried was the following:
.//*[@id='rg_s']/div[1]/a/@href
returned the following in the extension: Xpath Helper:
http://www.google.com/imgres?imgurl=http://images.enstarz.com/data/images/full/15421/emma-watson.jpg&imgrefurl=http://www.styliwallpapers.com/celebrities/emma-watson/&h=2332&w=3500&tbnid=DPhW7CJ1erAD0M:&zoom=1&docid=22MKylYlja8LIM&ei=9oOUVbzdHsShgwTXqYOYBw&tbm=isch&ved=0CBsQMygAMAA
which actually is what I want. Then I'll scrap that url too and get the url for the Image. That's what I'm looking for. But for some reason I tried the Xpath on Scrapy Shell
and returns []
. Empty!
Tried with other XPath (to the same place):
.//div[@class='rg_di rg_el ivg-i'][1]/a[@class='rg_l']/@href
and still the same answer: []
.
I don't know what I'm doing wrong. Can you help me with this?
PS. What I use in Scrapy Shell is the following:
>response.xpath(".//*[@id='rg_s']/div[1]/a/@href")
# returned: []
>response.xpath(".//div[@class='rg_di rg_el ivg-i'][1]/a[@class='rg_l']/@href")
# returned: []
Something to add: When I tried to get the Title of the Page, it works.
>response.xpath(".//title/text()").extract()
# returns: [u'emma watson - Google Search']
Upvotes: 1
Views: 1739
Reputation: 4307
According to my results using scrapy view
, Google Images does in fact load the first 20 images by default without using JavaScript. Try this XPath instead:
//table[@class="images_table"]//img/parent::a/@href
If you need to access a specific image, wrap the img
result and use an index:
(//table[@class="images_table"]//img)[1]/parent::a/@href
Upvotes: 2
Reputation: 3691
Did you look at the response body in Scrapy?
Some modern websites do not load everything on-the-fly when you load a site because the response can be slow depending on the network and server load and the users would end up going to look at other pages. This is why they use asynchronous loading of resources (AJAX, XHR as some keywords). The site does the same. When you call the site in the browser there is a bunch of network traffic going on -- and there are two XHR responses too.
If you look at the response body in Scrapy you will find that there is no element having an id 'rg_s'
which you are looking for neither one with a class 'rg_di rg_el ivg-i'
.
If you open / copy / download the XHR responses, one of them contains the URL you have found with XPath Helper in the site.
This means that the site which Scrapy crawls has some dynamic features which are not executed when crawling and you download a different HTML than which is displayed in your browser.
Upvotes: 1