Axel Magnuson
Axel Magnuson

Reputation: 1202

Scrapy Spider not Following Links

I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com, it successfully extracts a list of article urls with le.extract_links(response), but I can't get my crawl command (scrapy crawl nyt -o out.json) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article  

Upvotes: 3

Views: 754

Answers (1)

Axel Magnuson
Axel Magnuson

Reputation: 1202

I have found the solution to my problem. I was doing 2 things wrong:

  1. I needed to subclass CrawlSpider rather than Spider if I wanted it to automatically crawl sublinks.
  2. When using CrawlSpider, I needed to use a callback function rather than overriding parse. As per the docs, overriding parse breaks CrawlSpider functionality.

Upvotes: 4

Related Questions