Reputation: 97
Update
This is embarrassing, but it turned out that the problem with my original pipeline was that I'd forgotten to activate it in my settings. eLRuLL was right anyway, though.
I'm at the stage where I have a working spider that can consistently retrieve the information I'm interested in and push it out in the format I want. My—hopefully—final stumbling block is applying a more reasonable naming convention to the files saved by my images pipeline. The SHA1 hash works, but I find it really unpleasant to work with.
I'm having trouble interpreting the documentation to figure out how to change the naming system, and I didn't have any luck blindly applying this solution. In the course of my scrape, I'm already pulling down a unique identifier for each page; I'd like to use it to name the images, since there's only one per page.
The image pipeline also doesn't seem to respect the fields_to_export
section of my pipeline. I'd like to suppress the image urls to give myself a cleaner, more readable output. If anyone has an idea how to do that, I'd be very grateful.
The unique identifier that it'd like to pull out of my parse is CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
. You'll find my spider and my pipelines below.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1315
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "reverse|obverse and (medal|medallion)"
notkey = "Image Not Available"
n = re.compile('.*(%s).*' % notkey, re.IGNORECASE|re.MULTILINE|re.UNICODE)
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if not n.search(response.body_as_unicode()):
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()', Join(), re='[A-Z]+')
CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')
CatalogRecord.add_xpath('date', './/dt[@class="title"]', re='(\d+-\d+)')
return CatalogRecord.load_item()
Pipelines:
class NgamedallionsPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['accession', 'title', 'date', 'inscription']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Upvotes: 2
Views: 2294
Reputation: 20748
Regarding renaming the images written to disk, here's one way to do it:
meta
for the images Request
generated by the pipeline by overriding get_media_requests()
file_path()
and use that info from meta
Example custom ImagesPipeline
:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class NgaImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
# use 'accession' as name for the image when it's downloaded
return [scrapy.Request(x, meta={'image_name': item["accession"]})
for x in item.get('image_urls', [])]
# write in current folder using the name we chose before
def file_path(self, request, response=None, info=None):
return '%s.jpg' % request.meta['image_name']
Regarding exported fields, the suggestion from @eLRuLL worked for me:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import signals
from scrapy.exporters import CsvItemExporter
class NgaCsvPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
ofile = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = ofile
self.exporter = CsvItemExporter(ofile,
fields_to_export = ['accession', 'title', 'date', 'inscription'])
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
ofile = self.files.pop(spider)
ofile.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Upvotes: 3