Friezan
Friezan

Reputation: 41

Scrapy RSS Scraper

I am trying to scrape an RSS feed from Yahoo ( their Open Company RSS Feed | https://developer.yahoo.com/finance/company.html )

I am trying to scrape the following URL: https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX

For some reason my spider isn't functioning, and I think it might have to do with either the XPath that's generated, if not, defining parse_item might have some issues.

import scrapy
from scrapy.spiders import CrawlSpider
from YahooScrape.items import YahooScrapeItem

class Spider(CrawlSpider):
    name= "YahooScrape"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',)

   def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = EmperyscraperItem()
        item['title'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()                #define XPath for title
        item['link'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()                 #define XPath for link
        item['description'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()          #define XPath for description
        return item

What could be the issue with the code? If not, what are the proper XPath directions to extract the title, desc, and link. I'm new to Scrapy and just need some help figuring it out!

Edit: I've updated my spider and converted it into an XMLFeedSpider as shown below:

import scrapy

from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
    name = "YahooScrape"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX')    #Crawl BPMX
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = YahooScrapeItem()
        item['title'] = node.xpath('item/title/text()',).extract()                #define XPath for title
        item['link'] = node.xpath('item/link/text()').extract()
        item['pubDate'] = node.xpath('item/link/pubDate/text()').extract()
        item['description'] = node.xpath('item/category/text()').extract()                #define XPath for description
        return item

#Yahoo RSS feeds http://finance.yahoo.com/rss/headline?s=BPMX,APPL

Now I'm getting the following error:

2017-06-13 11:25:57 [scrapy.core.engine] ERROR: Error while obtaining start requests

Any idea why the error has occurred? My HTML path looks correct.

Upvotes: 2

Views: 3600

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

From what I can see, CrawlSpider only works for HTML responses. So I suggest that you build upon a simpler scrapy.Spider, or the more specialized XMLFeedSpider.

Then, the XPaths you are using in parse_items seem to have been built from what your browser rendered as HTML from the XML/RSS feed. There's no *[@id="collapsible"] or <div>s in the feed.

Look at view-source:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX instead:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
    <channel>
        <copyright>Copyright (c) 2017 Yahoo! Inc. All rights reserved.</copyright>
        <description>Latest Financial News for BPMX</description>
        <image>
            <height>45</height>
            <link>http://finance.yahoo.com/q/h?s=BPMX</link>
            <title>Yahoo! Finance: BPMX News</title>
            <url>http://l.yimg.com/a/i/brand/purplelogo/uh/us/fin.gif</url>
            <width>144</width>
        </image>
        <item>
            <description>MENLO PARK, Calif., June 7, 2017 /PRNewswire/ -- BioPharmX Corporation (NYSE MKT: BPMX), a specialty pharmaceutical company focusing on dermatology, today announced that it will release its financial results ...</description>
            <guid isPermaLink="false">f56d5bf8-f278-37fd-9aa5-fe04b2e1fa53</guid>
            <link>https://finance.yahoo.com/news/biopharmx-report-first-quarter-financial-101500259.html?.tsrc=rss</link>
            <pubDate>Wed, 07 Jun 2017 10:15:00 +0000</pubDate>
            <title>BioPharmX to Report First Quarter Financial Results</title>
        </item>

Working spider example:

import scrapy

from scrapy.spiders import XMLFeedSpider
#from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
    name = "YahooScrape"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',)    #Crawl BPMX
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = {}
        item['title'] = node.xpath('title/text()',).extract_first()                #define XPath for title
        item['link'] = node.xpath('link/text()').extract_first()
        item['pubDate'] = node.xpath('link/pubDate/text()').extract_first()
        item['description'] = node.xpath('description/text()').extract_first()                #define XPath for description
        return item

Upvotes: 3

Related Questions