Reputation: 225
I'm trying to scrape all of a site's entries and available content to try to learn using scrapy. So far, I've been able to scrape all of the blog entries on a page and then go to the next page and scrape the content there. I've also found the next page's link. However, I can't figure out how to procceed from there even though I've read quite a few tutorials and looked at example code. I have thus far :
class SaltandLavender(CrawlSpider):
logging.getLogger('scrapy').propagate = False
name = 'saltandlavender'
rules = (
Rule(LinkExtractor(allow=''), callback="parse", follow= True),
def parse(self,response):
#with open('page.html', 'wb') as html_file:
# html_file.write(response.body)
print "start 1"
for href in response.css('.entry-title a'):
print "middle 1"
yield response.follow(href, callback=self.process_page)
next=response.css('li.pagination-next a::text')
if next:
url=''.join(response.css('li.pagination-next a::attr(href)').extract())
print url
def process_page(self,response):
print "start 2"
post_images=response.css('div.entry-content img::attr(src)').extract()
content = {
'title': ''.join(response.css('article.format-standard h1.entry-title::text').extract()),
#'content': response.xpath(".//div[@class='entry-content']/descendant::text()").extract(),
'ingredients': ''.join(response.css('div.wprm-recipe-ingredients-container div.wprm-recipe-ingredient-group').extract()),
#print content
print "end 2"
def errorCatch(self):
print "Script encountered an error. Check selectors for changes in the site's layout and design..."
def updateValid(self):
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
Upvotes: 0
Views: 73
Reputation: 3857
You need to yield the request, not just create an instance of it.
yield Request(url)
Upvotes: 0
Reputation: 3717
There is something wrong with your next page requesting. For example, you use next
variable, that is built-in reserved word and also you don't yield next request. Check this fix:
def parse(self,response):
for href in response.css('.entry-title a'):
yield response.follow(href, callback=self.process_page)
next_page = response.css('li.pagination-next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
Upvotes: 1