Reputation: 1345
I am using scrapy to crawl my website http://www.cseblog.com
My spider is as follows:
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup ## This is BeautifulSoup4
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from blogscraper.items import BlogArticle ## This is for saving data. Probably insignificant.
class BlogArticleSpider(BaseSpider):
name = "blogscraper"
allowed_domains = ["cseblog.com"]
start_urls = [
"http://www.cseblog.com/",
]
rules = (
Rule(SgmlLinkExtractor(allow=('\d+/\d+/*"', ), deny=( ))),
)
def parse(self, response):
site = BeautifulSoup(response.body_as_unicode())
items = []
item = BlogArticle()
item['title'] = site.find("h3" , {"class": "post-title" } ).text.strip()
item['link'] = site.find("h3" , {"class": "post-title" } ).a.attrs['href']
item['text'] = site.find("div" , {"class": "post-body" } )
items.append(item)
return items
Where do I specify that it needs to crawl all the links of the type http://www.cseblog.com/{d+}/{d+}/{*}.html and http://www.cseblog.com/search/{*} recursively
but save data from http://www.cseblog.com/{d+}/{d+}/{*}.html
Upvotes: 3
Views: 413
Reputation: 9185
You have to create either two rules or one telling scrapy to allow the url of those types. Basically you want the rules list will be something like this
rules = (
Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/{d+}/{d+}/{*}.html', ), deny=( )),call_back ='parse_save' ),
Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/search/{*}', ), deny=( )),,call_back = 'parse_only' ))
BTW, you should be using crawl spider and rename parse method name unless you want to override the method from the base class.
Both the link types have different callbacks, in effect, you can decide which processed page data you want to save. Rather than having a single callback, and again doing a check on response.url.
Upvotes: 1