Reputation: 1202
I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com
, it successfully extracts a list of article urls with le.extract_links(response)
, but I can't get my crawl command (scrapy crawl nyt -o out.json
) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
Upvotes: 3
Views: 754
Reputation: 1202
I have found the solution to my problem. I was doing 2 things wrong:
CrawlSpider
rather than Spider
if I wanted it to automatically crawl sublinks.CrawlSpider
, I needed to use a callback function rather than overriding parse
. As per the docs, overriding parse
breaks CrawlSpider
functionality.Upvotes: 4