Reputation: 71
I am new to python scrapy and trying to get through a small example, however I am having some problems! I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple... My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[@id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
Upvotes: 0
Views: 565
Reputation: 29
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.
Upvotes: 1