Reputation: 365
I'm using the itemloader to process my scraped data, and, in order to maintain the structure and integrity of the original data, and to allow for easy database insertion, I need to store the empty values that my XPaths sometimes come up with.
The problem is, however, that there seems to be no simple way of doing this using the itemloader, as None-types don't even seem to reach the input processor.
For simplicity, consider trying to add two values of type None to an item like follows:
loader.add_value('name', None)
loader.add_value('name', None)
The item will not be affected at all by these two lines. This is not the behavior I want. Instead, I would like there to be two (new) elements in item['name']
like ["",""]
I modified the _add_value()
and load_item()
methods of the ItemLoader
class like this:
def _add_value(self, field_name, value):
value = arg_to_iter(value)
processed_value = self._process_input_value(field_name, value)
self._values.setdefault(field_name, [])
self._values[field_name] += arg_to_iter(processed_value)
def load_item(self):
adapter = ItemAdapter(self.item)
for field_name in tuple(self._values):
value = self.get_output_value(field_name)
if value:
adapter[field_name] = value
else:
adapter[field_name] = "NA"
return adapter.item
This at least prevents the empty fields, but I have no idea what problems might arise from doing this, and it doesn't really solve my problem, since I want to store all empty data.
One solution is of course to simply not use the itemloader, and instead just check if the value of response.xpath()
is null. However,that would cause my project to become a lot messier, which I would like to avoid if possible.
Any ideas?
Upvotes: 2
Views: 861
Reputation: 1106
You can use dataclass or attr.s items, and set a default value:
from dataclasses import dataclass, field
@dataclass
class QuoteItem:
text: str = field(default=None)
author: str = field(default=None)
tags: list = field(default=None)
emp: str = field(default="NA")
you have to also change emp
's output_processor to take the second value:
from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader
def take_second(value):
if len(value) > 1:
return value[1]
class QuoteLoader(ItemLoader):
text_out = TakeFirst()
author_out = TakeFirst()
emp_out = take_second
Upvotes: 1