Silas
Silas

Reputation: 89

scrapy crawlspider output

I'm having an issue running through the CrawlSpider example in the Scrapy documentation. It seems to be crawling just fine but I'm having trouble getting it to output to a CSV file (or anything really).

So, my question is can I use this:

scrapy crawl dmoz -o items.csv

or do I have to create an Item Pipeline?

UPDATED, now with code!:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from targets.item import TargetsItem

class MySpider(CrawlSpider):
    name = 'abc'
    allowed_domains = ['ididntuseexample.com']
    start_urls = ['http://www.ididntuseexample.com']

    rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('ididntuseexample.com', ))),

)

    def parse_item(self, response):
       self.log('Hi, this is an item page! %s' % response.url)
       item = TargetsItem()
       item['title'] = response.xpath('//h2/a/text()').extract() #this pulled down data in scrapy shell
       item['link'] = response.xpath('//h2/a/@href').extract()   #this pulled down data in scrapy shell
       return item

Upvotes: 0

Views: 3121

Answers (1)

dreyescat
dreyescat

Reputation: 13798

Rules are the mechanism CrawlSpider uses for following links. Those links are defined with a LinkExtractor. This element basically indicates which links to extract from the crawled page (like the ones defined in the start_urls list) to be followed. Then you can pass a callback that will be called on each extracted link, or more precise, on the pages downloaded following those links.

Your rule must call the parse_item. So, replace:

Rule(LinkExtractor(allow=('ididntuseexample.com', ))),

with:

Rule(LinkExtractor(allow=('ididntuseexample.com',)), callback='parse_item),

This rule defines that you want to call parse_item on every link whose href is ididntuseexample.com. I suspect that what you want as link extractor is not the domain, but the links you want to follow/scrape.

Here you have a basic example that crawls Hacker News to retrieve the title and the first lines of the first comment for all the news in the main page.

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class HackerNewsItem(scrapy.Item):
    title = scrapy.Field()
    comment = scrapy.Field()

class HackerNewsSpider(CrawlSpider):
    name = 'hackernews'
    allowed_domains = ['news.ycombinator.com']
    start_urls = [
        'https://news.ycombinator.com/'
    ]
    rules = (
        # Follow any item link and call parse_item.
        Rule(LinkExtractor(allow=('item.*', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = HackerNewsItem()
        # Get the title
        item['title'] = response.xpath('//*[contains(@class, "title")]/a/text()').extract()
        # Get the first words of the first comment
        item['comment'] = response.xpath('(//*[contains(@class, "comment")])[1]/font/text()').extract()
        return item

Upvotes: 2

Related Questions