Reputation: 14086
How can I make the following crawler, using the scrapy
python library, browse the entire website recursively:
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ul[@class="directory-url"]/li/a/text()').extract()
for t in titles:
print "Title: ", t
I've tried this on a single page:
start_urls = [
"http://www.dmoz.org/Society/Philosophy/Academic_Departments/Africa/"
]
It works well but only returns results from the start url, and doesn't follow the links within the domain.
I suppose this must be done manually with Scrapy
but don't know how.
Upvotes: 1
Views: 1902
Reputation: 20748
Try using a CrawlSpider
(see documentation), with a single Rule()
with a LinkExtractor
that filters only on the domain(s) you want:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/"
]
rules = (
Rule(
SgmlLinkExtractor(allow_domains=("dmoz.org",)),
callback='parse_page', follow=True
),
)
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ul[@class="directory-url"]/li/a/text()').extract()
for t in titles:
print "Title: ", t
The callback must be called something else than parse
(see this warning)
Upvotes: 2