Reputation: 813
I'm trying to make a specifically aimed webcrawler with scrapy that returns an object of my results. I'm getting stuck and am probably going about things totally backwards.
More specifically, for each of the subforums at TheScienceForum.com (math, physics, etc), I would like to get the titles of all the threads within each of the subforums and end up with an object that has the name of the forum and a list of all the titles of the threads within the forum.
The end goal is to do text analysis on the thread titles to determine the most common terms/jargon associated with each forum. Eventually I would like to do analysis of the threads themselves as well.
I have one class Item defined as the following:
from scrapy.item import Item, Field
class ProjectItem(Item):
name = Field() #the forum name
titles = Field() #the titles
I may be misunderstanding how items work, but I would like to end up with one item for each subforum with all the thread titles from that subforum in a list in the same item.
The crawler I wrote looks like this but does not function as expected:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from individualProject.items import ProjectItem
class TheScienceForum(CrawlSpider):
name = "TheScienceForum.com"
allowed_domains = ["theScienceForum.com"]
start_urls = ["http://www.thescienceforum.com"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=['//h2[@class="forumtitle"]/a']), 'parse_one'),Rule(SgmlLinkExtractor(restrict_xpaths=['//div[@class="threadpagenav"]']), 'parse_two')]
def parse_one(self, response):
Sel = HtmlXPathSelector(response)
forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
items = []
for forumName in forumNames:
item = projectItem()
item['name'] = forumName
items.append(item)
yield items
def parse_two(self, response):
Sel = HtmlXPathSelector(response)
threadNames = Sel.select('////h3[@class="threadtitle"]/a/text()').extract()
for item in items:
for title in titles:
if Sel.select('//h1/span[@class="forumtitle"]/text()').extract()==item.name:
item['titles'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
return items
The idea is to start on the main page for the site where all the subforum names are. The first rule only allows links to the first subforum page and the parse function associated with it it meant to create an item for each of the subforums, subbing in the forum name for the 'name' attribute.
For the following requests, using the second rule, the spider is limited to navigating the pages containing all the threads (the paginated links) of the sub forums. The second parse method is meant to add the thread titles to the item (created in first parse method) that corresponds with the name of the current subforum (Sel.select('//h1/span[@class="forumtitle"]/text()').extract())
The spider is crawling to all the main forum pages, but for each one I am getting the following error:
2013-11-01 13:05:37-0400 [TheScienceForum.com] ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.thescienceforum.com/mathematics/>
Any help or advice would be greatly appreciated. Thanks!
Upvotes: 1
Views: 3468
Reputation: 813
I found a solution to the crawling problem I was running into. The following code starts a spider on the forum homepage, creating a new item for each sub forum. The spider then follows the links, going to each page of the sub forum, collecting the thread titles along the way (adding them to the relevant item, all of which are being passed along with the next request). The code is as follows:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from individualProject.items import ProjectItem
class TheScienceForum(BaseSpider):
name = "TheScienceForum.com"
allowed_domains = ["www.thescienceforum.com"]
start_urls = ["http://www.thescienceforum.com"]
#rules = [Rule(SgmlLinkExtractor(restrict_xpaths=['//h2[@class="forumtitle"]/a']), 'parse_one'),Rule(SgmlLinkExtractor(restrict_xpaths=['//div[@class="threadpagenav"]']), 'parse_two')]
def parse(self, response):
Sel = HtmlXPathSelector(response)
forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
items = []
for forumName in forumNames:
item = ProjectItem()
item['name'] = forumName
items.append(item)
forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
itemDict = {}
itemDict['items'] = items
for forum in forums:
yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)
def addThreadNames(self, response):
items = response.meta['items']
Sel = HtmlXPathSelector(response)
currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
for item in items:
if currentForum==item['name']:
item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
self.log(items)
itemDict = {}
itemDict['items'] = items
threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
for threadPageNav in threadPageNavs:
yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)
The issue I' running into now is how to save the data it is meant to be categorizing (to be later analized). I opened another question here in that regard.
Upvotes: 2
Reputation: 9185
As suggest by Christian Temus be more descriptive in what problems you are facing. Looking into code I can make some suggestions
Rather than returning list of items, you should be doing "yield item" in the for loop.
Use a crawlspider
If you use a crawlspider rename 'parse' methond to something else, like, parse_titles.
Upvotes: 0