jox58
jox58

Reputation: 65

How to Serialize Scrapy Fields that are Lists of Items in XML Exporter

I built complex Items where fields may be lists of other Item types. When I export it with the default XmlItemExporter the sub-list Items are prefixed with <value> tags. I'm looking for an example of how to assign the sub-Item identifiers to those value tags.

The Item Exporters page of the docs explain this saying:

Unless overridden in the serialize_field() method, multi-valued fields are exported by serializing each value inside a <value> element. This is for convenience, as multi-valued fields are very common.

The docs page also give simple examples for Declaring A Serializer In The Field and Overriding The Serialize_Field() Method, but both are for single valued fields with no suggestion for how they can be customized for multi-valued fields.

I searched the web looking for an example of how that is done and I haven't found any.

Here is a sample Item tree I used for testing:

class Course(scrapy.Item):
    title = scrapy.Field()
    lessons = scrapy.Field()

class Lesson(scrapy.Item):
    session = scrapy.Field()
    topic = scrapy.Field()
    assignment = scrapy.Field()

class ReadingAssignment(scrapy.Item):
    textBook = scrapy.Field()
    pages = scrapy.Field()

course = Course()
course['title'] = 'Greatness'
course['lessons'] = []

lesson = Lesson()
lesson['session'] = 'Week 1'
lesson['topic'] = 'Think Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 1'
reading['pages'] = '1-20'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 2'
lesson['topic'] = 'Act Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 2'
reading['pages'] = '21-40'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 3'
lesson['topic'] = 'Look Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 3'
reading['pages'] = '41-60'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

lesson = Lesson()
lesson['session'] = 'Week 4'
lesson['topic'] = 'Be Great'
lesson['assignment'] = []

reading =  ReadingAssignment()
reading['textBook'] = 'Great Book 4'
reading['pages'] = '61-80'
lesson['assignment'].append(reading)
course['lessons'].append(lesson)

outputs:

>>> course
{'lessons': [{'assignment': [{'pages': '1-20', 'textBook': 'Great Book 1'}],
              'session': 'Week 1',
              'topic': 'Think Great'},
             {'assignment': [{'pages': '21-40', 'textBook': 'Great Book 2'}],
              'session': 'Week 2',
              'topic': 'Act Great'},
             {'assignment': [{'pages': '41-60', 'textBook': 'Great Book 3'}],
              'session': 'Week 3',
              'topic': 'Look Great'},
             {'assignment': [{'pages': '61-80', 'textBook': 'Great Book 4'}],
              'session': 'Week 4',
              'topic': 'Be Great'}],
 'title': 'Greatness'}

When I run this through the XmlItemExporter I get:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <value>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </value>
        </assignment>
      </value>
      <value>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <value>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </value>
        </assignment>
      </value>
    </lessons>
  </course>
</items>

What I'd like to do is change those <value> tags to the names of the Items appended into the lists. Like this:

<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <lesson>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </reading>
        </assignment>
      </lesson>
    </lessons>
  </course>
</items>

Upvotes: 1

Views: 1360

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123400

This is indeed not well documented, and we'll have to resort to reading the XmlItemExporter source code, where it turns out that the <value> tag choice has been hard-coded in the XmlItemExporter._export_xml_field() method:

elif is_listlike(serialized_value):
    self._beautify_newline()
    for value in serialized_value:
        self._export_xml_field('value', value, depth=depth+1)
    self._beautify_indent(depth=depth)

Luckily, there is way out, on the lines before:

if hasattr(serialized_value, 'items'):
    self._beautify_newline()
    for subname, value in serialized_value.items():
        self._export_xml_field(subname, value, depth=depth+1)
    self._beautify_indent(depth=depth)

That's meant to handle a dictionary, but it in fact will take anything that has a .items() method that returns tuples of strings and items!

However, one important step is missing in the exporter: recursion. You can basically only set serializer flags on the top-level item fields, any Field() element on Item subclasses beyond the top-level item are entirely ignored by the current Scrapy implementation. And each exporter has their own peculiarities on how they drive the internal BaseItemExporter._get_serialized_fields() method, so we can't go and handle recursion up front as each specific exporter (JSON, XML, etc.) differs in how they need fields serialized. We can work around this with a subclass of the XmlItemExporter class, more below.

So the first trick here is to create a dedicated object that has a .items() method and gives you your <container> tags. Note that you have to handle recursion of serialisation ourselves! The Scrapy serializers don't themselves handle recursion into nested structures:

class CustomXMLValuesSerializer:
    @classmethod
    def serialize_as(cls, name):
        def serializer(items, serialize):
            return cls(name, items, serialize)
        return serializer

    def __init__(self, name, items, serialize=None):
        self._name = name
        self._items = items
        self._serialize = serialize if serialise is not None else lambda x: x

    def items(self):
        for item in self._items:
            yield (self._name, self._serialize(item))

then use the CustomXMLValuesSerializer.serialize_as() class methods to create custom serializers for your list fields:

class Course(scrapy.Item):
    title = scrapy.Field()
    lessons = scrapy.Field(
        serializer=CustomXMLValuesSerializer.serialize_as("lesson")
    )

class Lesson(scrapy.Item):
    session = scrapy.Field()
    topic = scrapy.Field()
    assignment = scrapy.Field(
        serializer=CustomXMLValuesSerializer.serialize_as("reading")
    )

class ReadingAssignment(scrapy.Item):
    textBook = scrapy.Field()
    pages = scrapy.Field()

Finally, we need a slightly customised exporter, one that actually lets us handle nested items recursively:

from functools import partial

class RecursingXmlItemExporter(XmlItemExporter):
    def _recursive_serialized_fields(self, item):
        if isinstance(item, scrapy.Item):
            return dict(self._get_serialized_fields(item, default_value=''))
        return item

    def serialize_field(self, field, name, value):
        serializer = field.get('serializer', lambda x: x)
        try:
            return serializer(value, self._recursive_serialized_fields)
        except TypeError:
            return serializer(value)

Note that this passes in default_value='', because that's what the base XmlItemExporter.export_item() implementation uses.

Make sure to use this custom exporter, as it passes in the required context to serialize nested items:

exporter = RecursingXmlItemExporter(some_file, indent=2, item_element='course')
exporter.start_exporting()
exporter.export_item(course)
exporter.finish_exporting()

Now the containers are actually exported using the name string as the container element:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <course>
    <title>Greatness</title>
    <lessons>
      <lesson>
        <session>Week 1</session>
        <topic>Think Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 1</textBook>
            <pages>1-20</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 2</session>
        <topic>Act Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 2</textBook>
            <pages>21-40</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 3</session>
        <topic>Look Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 3</textBook>
            <pages>41-60</pages>
          </reading>
        </assignment>
      </lesson>
      <lesson>
        <session>Week 4</session>
        <topic>Be Great</topic>
        <assignment>
          <reading>
            <textBook>Great Book 4</textBook>
            <pages>61-80</pages>
          </reading>
        </assignment>
      </lesson>
    </lessons>
  </course>
</items>

I field issue #3888 with Scrapy to see if the project is interested in supporting nested Item structures better.

An alternate approach would be to export nested items with separate calls to the XmlItemExporter.export_item() method, but this then requires that the exporter is accessable as a global in the same namespace as the serializers, or that you subclass the exporter and... pass along the exporter to the serializers. And you then have to content with the fact that XmlItemExporter.export_item() hard-codes the indentation.

Upvotes: 3

Related Questions