Scrapy Crawler - How do I specify which links to crawl

Question

I am using scrapy to crawl my website http://www.cseblog.com

My spider is as follows:

from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup ## This is BeautifulSoup4
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from blogscraper.items import BlogArticle ## This is for saving data. Probably insignificant.

class BlogArticleSpider(BaseSpider):
    name = "blogscraper"
    allowed_domains = ["cseblog.com"]
    start_urls = [
        "http://www.cseblog.com/",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\d+/\d+/*"', ), deny=( ))),
    )

    def parse(self, response):
        site = BeautifulSoup(response.body_as_unicode())
        items = []
        item = BlogArticle()
        item['title'] = site.find("h3" , {"class": "post-title" } ).text.strip()
        item['link'] = site.find("h3" , {"class": "post-title" } ).a.attrs['href']
        item['text'] = site.find("div" , {"class": "post-body" } )
        items.append(item)
        return items

Where do I specify that it needs to crawl all the links of the type http://www.cseblog.com/{d+}/{d+}/{*}.html and http://www.cseblog.com/search/{*} recursively

but save data from http://www.cseblog.com/{d+}/{d+}/{*}.html

Biswanath · Accepted Answer

You have to create either two rules or one telling scrapy to allow the url of those types. Basically you want the rules list will be something like this

rules = (
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/{d+}/{d+}/{*}.html', ), deny=( )),call_back ='parse_save' ),
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/search/{*}', ), deny=( )),,call_back = 'parse_only' ))

BTW, you should be using crawl spider and rename parse method name unless you want to override the method from the base class.

Both the link types have different callbacks, in effect, you can decide which processed page data you want to save. Rather than having a single callback, and again doing a check on response.url.

Scrapy Crawler - How do I specify which links to crawl

Answers (1)

Related Questions