Creating a simple python crawler with scrapy

Question

I am currently trying to make a simple crawler in python using Scrapey. What I want it to do is read a list of links and save the html of the websites that they link to. Right now, I am able to get all of the URLs, but I am unable to figure out how to download the page. Here is the code for my spider so far:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BookItem

# Book scrappy spider

class DmozSpider(BaseSpider):
    name = "book"
    allowed_domains = ["learnpythonthehardway.org"]
    start_urls = [
        "http://www.learnpythonthehardway.org/book/",
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        file = open(filename,'wb')
        file.write(response.body)
        file.close()

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        items = []
        for site in sites:
            item = BookItem()
            item['title'] = site.select('a/text()').extract()
            item['link'] = site.select('a/@href').extract()
            items.append(item)
        return items

nneonneo · Accepted Answer

In your parse method, return Request objects in the list of returned items to trigger downloads:

for site in sites:
    ...
    items.append(item)
    items.append(Request(item['link']), callback=self.parse)

This will cause the crawler to produce a BookItem for each link, but also recurse and download each book's page. Of course, you can specify a different callback (e.g. self.parsebook) if you want to parse the subpages differently.

Creating a simple python crawler with scrapy

Answers (1)

Related Questions