Reputation: 684
I don't know where the issues lies probably super easy to fix since I am new to scrapy. Thanks for your help!
My Spider:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["economist.com"]
start_urls = ['http://www.economist.com/sections/science-technology']
rules = [
Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True),
]
def parse_item(self, response):
for sel in response.xpath('//div/article'):
item = scrapy.Item()
item ['title'] = sel.xpath('a/text()').extract()
item ['link'] = sel.xpath('a/@href').extract()
item ['desc'] = sel.xpath('text()').extract()
return item
Items:
import scrapy
class EconomistItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
Part of Log:
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Crawled (200) <GET http://www.economist.com/sections/science-technology> (referer: None)
Edit:
After I added the changes proposed by alecxe another problem occured:
Log:
[scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology)
2016-02-04 14:05:01 [scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21689501-beating-go-champion-machine-learning-computer-says-go> (referer: http://www.economist.com/sections/science-technology)
2016-02-04 14:05:02 [scrapy] ERROR: Spider error processing <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "/Users/FvH/Desktop/Python/projects/economist/economist/spiders/article.py", line 18, in parse_item
item = scrapy.Item()
NameError: global name 'scrapy' is not defined
Settings:
BOT_NAME = 'economist'
SPIDER_MODULES = ['economist.spiders']
NEWSPIDER_MODULE = 'economist.spiders'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
And if I want export the data into a csv file it's obviously just empty.
Thanks
Upvotes: 1
Views: 1090
Reputation: 855
You imported only Item (not all the scrapy module):
from scrapy.item import Item
So instead of using scrapy.Item here:
for sel in response.xpath('//div/article'):
item = scrapy.Item()
item ['title'] = sel.xpath('a/text()').extract()
You should use just Item:
for sel in response.xpath('//div/article'):
item = Item()
item ['title'] = sel.xpath('a/text()').extract()
Or import your own item for using it. This should work (don't forget to replace project_name with name of your project):
from project_name.items import EconomistItem
...
for sel in response.xpath('//div/article'):
item = EconomistItem()
item ['title'] = sel.xpath('a/text()').extract()
Upvotes: 0
Reputation: 473763
parse_item
is not correctly indented, should be:
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["economist.com"]
start_urls = ['http://www.economist.com/sections/science-technology']
rules = [
Rule(LinkExtractor(allow=r'Items'), callback='parse_item', follow=True),
]
def parse_item(self, response):
for sel in response.xpath('//div/article'):
item = scrapy.Item()
item ['title'] = sel.xpath('a/text()').extract()
item ['link'] = sel.xpath('a/@href').extract()
item ['desc'] = sel.xpath('text()').extract()
return item
Two things to fix aside from that:
the link extracting part should be fixed to match the article links:
Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True),
you need to specify the USER_AGENT
setting to pretend to be a real browser. Otherwise, the response
would not contain the list of articles:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
Upvotes: 2