Reputation: 47
Trying to get my webcrawler to crawl links extracted from a webpage. I'm using Scrapy. I can successfully pull data with my crawler, but can't get it to crawl. I believe the problem is in my rules section. New to Scrapy. Thanks for you help in advance.
I'm scraping this website:
http://ballotpedia.org/wiki/index.php/Category:2012_challenger
The links I'm trying to follow look like this in the source code:
/wiki/index.php/A._Ghani
or
/wiki/index.php/A._Keith_Carreiro
Here is the code for my spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
from ballot1.items import Ballot1Item
class Ballot1Spider(CrawlSpider):
name = "stewie"
allowed_domains = ["ballotpedia.org"]
start_urls = [
"http://ballotpedia.org/wiki/index.php/Category:2012_challenger"
]
rules = (
Rule (SgmlLinkExtractor(allow=r'w+'), follow=True),
Rule(SgmlLinkExtractor(allow=r'\w{4}/\w+/\w+'), callback='parse')
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('*')
items = []
for site in sites:
item = Ballot1Item()
item['candidate'] = site.select('/html/head/title/text()').extract()
item['position'] = site.select('//table[@class="infobox"]/tr/td/b/text()').extract()
item['controversies'] = site.select('//h3/span[@id="Controversies"]/text()').extract()
item['endorsements'] = site.select('//h3/span[@id="Endorsements"]/text()').extract()
item['currentposition'] = site.select('//table[@class="infobox"]/tr/td[@style="text-align:center; background-color:red;color:white; font-size:100%; font-weight:bold;"]/text()').extract()
items.append(item)
return items
Upvotes: 0
Views: 489
Reputation: 7889
You're using a CrawlSpider with a callback of parse
, which the scrapy documentation expressly warns will prevent crawling.
Rename it to something like parse_items
and you should be fine.
Upvotes: 1
Reputation: 298176
The links that you're after are only present in this element:
<div lang="en" dir="ltr" class="mw-content-ltr">
So you have to restrict the XPath to prevent extraneous links:
restrict_xpaths='//div[@id="mw-pages"]/div'
Finally, you only want to follow links that look like /wiki/index.php?title=Category:2012_challenger&pagefrom=Alison+McCoy#mw-pages
, so your final rules should look like:
rules = (
Rule(
SgmlLinkExtractor(
allow=r'&pagefrom='
),
follow=True
),
Rule(
SgmlLinkExtractor(
restrict_xpaths='//div[@id="mw-pages"]/div',
callback='parse'
)
)
)
Upvotes: 1
Reputation: 25693
r'w+'
is wrong (I think you meant r'\w+'
) and r'\w{4}/\w+/\w+'
doesn't look right too, as it doesn't match your links (it's missing a leading /
). Why don't you try just r'/wiki/index.php/.+'
?
Don't forget that \w
doesn't include .
and other symbols that can be parts of an article name.
Upvotes: 0