Lobs001
Lobs001

Reputation: 365

Default value/dealing with empty values when using the ItemLoader in Scrapy

I'm using the itemloader to process my scraped data, and, in order to maintain the structure and integrity of the original data, and to allow for easy database insertion, I need to store the empty values that my XPaths sometimes come up with.

The problem is, however, that there seems to be no simple way of doing this using the itemloader, as None-types don't even seem to reach the input processor.

For simplicity, consider trying to add two values of type None to an item like follows:

loader.add_value('name', None)
loader.add_value('name', None)

The item will not be affected at all by these two lines. This is not the behavior I want. Instead, I would like there to be two (new) elements in item['name'] like ["",""]

I modified the _add_value() and load_item() methods of the ItemLoader class like this:

def _add_value(self, field_name, value):
    value = arg_to_iter(value)
    processed_value = self._process_input_value(field_name, value)
    self._values.setdefault(field_name, [])
    self._values[field_name] += arg_to_iter(processed_value)

def load_item(self):
        adapter = ItemAdapter(self.item)
        for field_name in tuple(self._values):
            value = self.get_output_value(field_name)
            if value:
                adapter[field_name] = value
            else: 
                adapter[field_name] = "NA"
        return adapter.item

This at least prevents the empty fields, but I have no idea what problems might arise from doing this, and it doesn't really solve my problem, since I want to store all empty data.

One solution is of course to simply not use the itemloader, and instead just check if the value of response.xpath() is null. However,that would cause my project to become a lot messier, which I would like to avoid if possible.

Any ideas?

Upvotes: 2

Views: 861

Answers (1)

zaki98
zaki98

Reputation: 1106

You can use dataclass or attr.s items, and set a default value:

from dataclasses import dataclass, field


@dataclass
class QuoteItem:
    text: str = field(default=None)
    author: str = field(default=None)
    tags: list = field(default=None)
    emp: str = field(default="NA")

you have to also change emp's output_processor to take the second value:

from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader


def take_second(value):
    if len(value) > 1:
        return value[1]

class QuoteLoader(ItemLoader):
    text_out = TakeFirst()
    author_out = TakeFirst()
    emp_out = take_second

Upvotes: 1

Related Questions