Reputation: 43
I'm trying to parse and convert XML to CSV. The tricky part is that headers should exactly match terms specified in the documentation of 3rd party CSV parser and it contains spaces between words, i.e. "Item title", "Item description", etc.
Since Items are defined as variables in items.py, I'm can't create Items containing spaces, i.e.
Item title = scrapy.Field()
I've tried adding to settings.py:
FEED_EXPORT_FIELDS = ["Item title", "Item description"]
It edits CVS headers, but after this it doesn't match Items anymore so it doesn't populated any data into .csv.
class MySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/feed.xml']
itertag = 'item'
def parse_node(self, response, node):
item = FeedItem()
item['id'] = node.xpath('//*[name()="g:id"]/text()').get()
item['title'] = node.xpath('//*[name()="g:title"]/text()').get()
item['description'] = node.xpath('//*[name()="g:description"]/text()').get()
return item
Parser works fine, I get all the data I need. The issue is just with csv headers.
Is there a way to easily add customized headers that doesn't match names of Items and can contain few words?
Output I currently get:
id, title, description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.
Desired output should look like this:
ID, Item title, Item description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.
Input for testing:
<rss>
<channel>
<title>Example</title>
<link>http://www.example.com</link>
<description>Description of Example.com</description>
<item>
<g:id>12345</g:id>
<g:title>Lorem Ipsum</g:title>
<g:description>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</g:description>
</item>
<item>
<g:id>12346</g:id>
<g:title>Quick Fox</g:title>
<g:description>The quick brown fox jumps over the lazy dog.</g:description>
</item>
</channel>
</rss>
And this is the content of items.py:
import scrapy
class FeedItem(scrapy.Item):
id = scrapy.Field()
title = scrapy.Field()
description = scrapy.Field()
pass
Upvotes: 4
Views: 1831
Reputation: 21436
You can make your own csv exporter! Ideally you can just extend the current exporter with a different method:
# exporters.py
from scrapy.exporters import CsvItemExporter
class MyCsvItemExporter(CsvItemExporter):
header_map = {
'description': 'Item Description',
}
def _write_headers_and_set_fields_to_export(self, item):
if not self.include_headers_line:
return
# this is the parent logic taken from parent class
if not self.fields_to_export:
if isinstance(item, dict):
# for dicts try using fields of the first item
self.fields_to_export = list(item.keys())
else:
# use fields declared in Item
self.fields_to_export = list(item.fields.keys())
headers = list(self._build_row(self.fields_to_export))
# here we add our own extra mapping
# map headers to our value
headers = [self.header_map.get(header, header) for header in headers]
self.csv_writer.writerow(headers)
And then activate it in your settings:
FEED_EXPORTERS = {
'csv': 'myproject.exporters.MyCsvItemExporter',
}
Upvotes: 1
Reputation: 3561
You can use built-in dictionary dict
type as item with required csv header values as dictionary key:
def parse_node(self, response, node):
item = dict() #item = {}
item['ID'] = node.xpath('//*[name()="g:id"]/text()').get()
item['Item title'] = node.xpath('//*[name()="g:title"]/text()').get()
item['Item description'] = node.xpath('//*[name()="g:description"]/text()').get()
return item #yield item
Upvotes: 0