user3136348
user3136348

Reputation: 55

unable to scrape through scrapy while scraping rss feed

I want to scrape all title tags along with other tags within parent item tag . But unable to scrape. Tried scrapy shell and it seems to work fine . Below is my whole code

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy_rss.items import rssItem
from scrapy.utils.response import get_base_url
import time

class MySpider(CrawlSpider):
    name = 'rssaggr'
    allowed_domains = ['indianexpress.com']
    start_urls = ['http://indianexpress.com/section/sports/feed/']
    rules = (
        Rule(SgmlLinkExtractor(allow=('', ), deny=('defghi\.txt')), callback='parse_item',follow=True),
    )
    def parse_item(self, response):
     sel = Selector(response)
     items = sel.xpath('//item')
     for elements in items:
      item = rssItem()
      item['title'] = elements.xpath('./title/text()').extract()
      return item

Below is my items.py

from scrapy.item import Item, Field

class ScrapyRssItem(Item):
    # define the fields for your item here like:
    # name = Field()
    pass

class rssItem(Item):
    title = Field()

Upvotes: 1

Views: 319

Answers (1)

AniversarioPeru
AniversarioPeru

Reputation: 128

Your function should be named parse not parse_item. Scrapy expects you to overwrite the parse method of the spider. So you should not use a different name (see the documentation).

Also, your code will return only the first parsed item. You can add all the items to a list and then return it. I modified your code like this so you get all the items from the feed (I tested it and it works).

def parse(self, response):
    sel = Selector(response)
    items = sel.xpath('//item')
    parsed_items = []
    for elements in items:
        item = rssItem()
        item['title'] = elements.xpath('./title/text()').extract()
        parsed_items.append(item)
    return parsed_items

Upvotes: 2

Related Questions