Pratik Poddar
Pratik Poddar

Reputation: 1345

Scrapy Crawler - How do I specify which links to crawl

I am using scrapy to crawl my website http://www.cseblog.com

My spider is as follows:

from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup ## This is BeautifulSoup4
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from blogscraper.items import BlogArticle ## This is for saving data. Probably insignificant.

class BlogArticleSpider(BaseSpider):
    name = "blogscraper"
    allowed_domains = ["cseblog.com"]
    start_urls = [
        "http://www.cseblog.com/",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\d+/\d+/*"', ), deny=( ))),
    )

    def parse(self, response):
        site = BeautifulSoup(response.body_as_unicode())
        items = []
        item = BlogArticle()
        item['title'] = site.find("h3" , {"class": "post-title" } ).text.strip()
        item['link'] = site.find("h3" , {"class": "post-title" } ).a.attrs['href']
        item['text'] = site.find("div" , {"class": "post-body" } )
        items.append(item)
        return items

Where do I specify that it needs to crawl all the links of the type http://www.cseblog.com/{d+}/{d+}/{*}.html and http://www.cseblog.com/search/{*} recursively

but save data from http://www.cseblog.com/{d+}/{d+}/{*}.html

Upvotes: 3

Views: 413

Answers (1)

Biswanath
Biswanath

Reputation: 9185

You have to create either two rules or one telling scrapy to allow the url of those types. Basically you want the rules list will be something like this

rules = (
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/{d+}/{d+}/{*}.html', ), deny=( )),call_back ='parse_save' ),
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/search/{*}', ), deny=( )),,call_back = 'parse_only' ))

BTW, you should be using crawl spider and rename parse method name unless you want to override the method from the base class.

Both the link types have different callbacks, in effect, you can decide which processed page data you want to save. Rather than having a single callback, and again doing a check on response.url.

Upvotes: 1

Related Questions