linksmai
linksmai

Reputation: 43

Scrapy's Custom CSV headers for CsvItemExporter

I'm trying to parse and convert XML to CSV. The tricky part is that headers should exactly match terms specified in the documentation of 3rd party CSV parser and it contains spaces between words, i.e. "Item title", "Item description", etc.

Since Items are defined as variables in items.py, I'm can't create Items containing spaces, i.e.

Item title = scrapy.Field()

I've tried adding to settings.py:

FEED_EXPORT_FIELDS = ["Item title", "Item description"]

It edits CVS headers, but after this it doesn't match Items anymore so it doesn't populated any data into .csv.

    class MySpider(XMLFeedSpider):
        name = 'example'
        allowed_domains = ['example.com']
        start_urls = ['http://example.com/feed.xml']
        itertag = 'item'

        def parse_node(self, response, node):
            item = FeedItem()
            item['id'] = node.xpath('//*[name()="g:id"]/text()').get()
            item['title'] = node.xpath('//*[name()="g:title"]/text()').get()
            item['description'] = node.xpath('//*[name()="g:description"]/text()').get()

            return item

Parser works fine, I get all the data I need. The issue is just with csv headers.

Is there a way to easily add customized headers that doesn't match names of Items and can contain few words?

Output I currently get:

id, title, description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.

Desired output should look like this:

ID, Item title, Item description
12345, Lorem Ipsum, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
12346, Quick Fox, The quick brown fox jumps over the lazy dog.

Input for testing:

<rss>
<channel>
  <title>Example</title>
  <link>http://www.example.com</link>
  <description>Description of Example.com</description>
        <item>
            <g:id>12345</g:id>
            <g:title>Lorem Ipsum</g:title>
            <g:description>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</g:description>
        </item>
        <item>
            <g:id>12346</g:id>
            <g:title>Quick Fox</g:title>
            <g:description>The quick brown fox jumps over the lazy dog.</g:description>
        </item>
</channel>
</rss>

And this is the content of items.py:

import scrapy

class FeedItem(scrapy.Item):
    id = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    pass

Upvotes: 4

Views: 1831

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21436

You can make your own csv exporter! Ideally you can just extend the current exporter with a different method:

# exporters.py 
from scrapy.exporters import CsvItemExporter

class MyCsvItemExporter(CsvItemExporter):
    header_map = {
        'description': 'Item Description',
    }

    def _write_headers_and_set_fields_to_export(self, item):
        if not self.include_headers_line:
            return
        # this is the parent logic taken from parent class
        if not self.fields_to_export:
            if isinstance(item, dict):
                # for dicts try using fields of the first item
                self.fields_to_export = list(item.keys())
            else:
                # use fields declared in Item
                self.fields_to_export = list(item.fields.keys())
        headers = list(self._build_row(self.fields_to_export))

        # here we add our own extra mapping
        # map headers to our value
        headers = [self.header_map.get(header, header) for header in headers]
        self.csv_writer.writerow(headers)

And then activate it in your settings:

FEED_EXPORTERS = {
    'csv': 'myproject.exporters.MyCsvItemExporter',
}

Upvotes: 1

Georgiy
Georgiy

Reputation: 3561

You can use built-in dictionary dict type as item with required csv header values as dictionary key:

    def parse_node(self, response, node):
        item = dict() #item = {}
        item['ID'] = node.xpath('//*[name()="g:id"]/text()').get()
        item['Item title'] = node.xpath('//*[name()="g:title"]/text()').get()
        item['Item description'] = node.xpath('//*[name()="g:description"]/text()').get()

        return item #yield item

Upvotes: 0

Related Questions