Reputation: 55
I want to scrape all title tags along with other tags within parent item tag . But unable to scrape. Tried scrapy shell and it seems to work fine . Below is my whole code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy_rss.items import rssItem
from scrapy.utils.response import get_base_url
import time
class MySpider(CrawlSpider):
name = 'rssaggr'
allowed_domains = ['indianexpress.com']
start_urls = ['http://indianexpress.com/section/sports/feed/']
rules = (
Rule(SgmlLinkExtractor(allow=('', ), deny=('defghi\.txt')), callback='parse_item',follow=True),
)
def parse_item(self, response):
sel = Selector(response)
items = sel.xpath('//item')
for elements in items:
item = rssItem()
item['title'] = elements.xpath('./title/text()').extract()
return item
Below is my items.py
from scrapy.item import Item, Field
class ScrapyRssItem(Item):
# define the fields for your item here like:
# name = Field()
pass
class rssItem(Item):
title = Field()
Upvotes: 1
Views: 319
Reputation: 128
Your function should be named parse
not parse_item
. Scrapy expects you to overwrite the parse
method of the spider. So you should not use a different name (see the documentation).
Also, your code will return only the first parsed item. You can add all the items to a list and then return it. I modified your code like this so you get all the items from the feed (I tested it and it works).
def parse(self, response):
sel = Selector(response)
items = sel.xpath('//item')
parsed_items = []
for elements in items:
item = rssItem()
item['title'] = elements.xpath('./title/text()').extract()
parsed_items.append(item)
return parsed_items
Upvotes: 2