gattoun
gattoun

Reputation: 106

Scraping CSVs with Scrapy

I'm trying to scrape all the CSVs from this site: transparentnevada.com

When you navigate to a specific agency i.e. http://transparentnevada.com/salaries/2016/university-nevada-reno/ , and hit Download Records, there's a link to a number of CSVs. I'd like to download all the CSVs.

My spider runs and appears to crawl all the records but isn't downloading anything:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class Spider2(CrawlSpider):
    #name of the spider
    name = 'nevada'

#list of allowed domains
allowed_domains = ['transparentnevada.com']

#starting url for scraping
start_urls = ['http://transparentnevada.com/salaries/all/']
rules = [
    Rule(LinkExtractor(
    allow=['/salaries/all/*']),
    follow=True),
    Rule(LinkExtractor(
    allow=['/salaries/2016/*/']),
    follow=True),
    Rule(LinkExtractor(
    allow=['/salaries/2016/*/#']),
    callback='parse_article',
    follow=True),
]

#setting the location of the output csv file
custom_settings = {
    'FEED_FORMAT' : "csv",
    'FEED_URI' : 'tmp/nevada2.csv'
}

def parse_article(self, response):
    for href in response.css('div.view-downloads a[href$=".csv"]::attr(href)').extract():
        yield Request(
            url=response.urljoin(href),
            callback=self.save_pdf
        )

def save_pdf(self, response):
    path = response.url.split('/')[-1]
    self.logger.info('Saving CSV %s', path)
    with open(path, 'wb') as f:
        f.write(response.body)

Upvotes: 1

Views: 105

Answers (1)

Tarun Lalwani
Tarun Lalwani

Reputation: 146610

The issue is that the CSV are on /export/ and you are doing nothing about them in your rules

I added a simple LinkExtractor to your scraper and it was downloading files

Rule(LinkExtractor(
    allow=['/export/.*\.csv']),
    callback='save_pdf',
    follow=True),

Also your above rules are not 100% correct you have used "/*" when it should be "/.*/".

"/*" means either slash is present or is there multiple times like "////". So fix your rules, add the rule i gave and that should get the work done

Upvotes: 2

Related Questions