Reputation: 69
Here is the code which i used for crawling a web page. The site I want to scrape has images lazy loading enabled, so scrapy can only grab 10 out of 100 images, the rest are all placeholder.jpg. What would be the best way to deal with lazy loading images in Scrapy?
Thanks!
class MasseffectSpider(scrapy.Spider):
name = "massEffect"
allowed_domains = ["amazon.com"]
start_urls = [
'file://127.0.0.1/home/ec2-user/scrapy/amazon/amazon.html',
]
def parse(self, response):
for item in items:
listing = Item()
listing['image'] = item.css('div.product img::attr(src)').extract()
listing['url'] = item.css('div.item-name a::attr(href)').extract()
listings.append(listing)
It seems other tools like CasperJS has the viewport to load the images.
casper.start('http://m.facebook.com', function() {
// The pretty HUGE viewport allows for roughly 1200 images.
// If you need more you can either resize the viewport or scroll down the viewport to load more DOM (probably the best approach).
this.viewport(2048,4096);
this.fill('form#login_form', {
'email': login_username,
'pass': login_password
}, true);
});
Upvotes: 3
Views: 4619
Reputation: 231
To scrape images in lazy loading, you have to track ajax request which returns images. After this you Hit that request in scrapy. After getting all data from certain page. You have to send Extracted data to other callback via meta in scrapy request. For further help Scrapy request
Upvotes: 1
Reputation: 5240
The problem is that lazy loading is being made by Javascript which scrapy can't handle, casperjs handles this.
To make this work with scrapy you have to mix it with Selenium or scrapyjs
Upvotes: 4