Reputation: 3737
So from the scrapy docs I see:
The input processor processes the extracted data as soon as it’s received .... and the result of the input processor is collected and kept inside the ItemLoader. After collecting all data, the ItemLoader.load_item() method is called to populate and get the populated item object. That’s when the output processor is called with the data previously collected (and processed using the input processor). The result of the output processor is the final value that gets assigned to the item.
I get the idea of the input processor. For example, have some data that you want to clean up? Just run it through the appropriate input processor. What I don't understand is the purpose of the output processor. How is this functionally even different from the input processor? Couldn't you just include whatever data transformation you want in the first input processor?
Upvotes: 0
Views: 559
Reputation: 108
I think the output processors are mainly useful when extracting single item's values from multiple tags by using more than one (same or different) selectors.
For example:
HTML snippet:
<span class="product-title">Samsung</span>
<p class="product-name">Color TV</p>
Extracting name from above two tags:
loader.add_xpath('name', '//span[@class="product-title"]/text()')
loader.add_xpath('name', '//p[@class="product-name"]/text()')
yield loader.load_item()
Now, using input processors you can manipulate texts extracted from above two xpaths each at a time and manipulated text will be appended in a list inside loader. If no input processor is defined, then output will be
['Samsung', 'Color TV'].
On calling loader.load_item() method, output processor(s) will be called with above resulted list as input argument and can be manipulated in any way to get the final result as name item.
The default output processor is Identity which returns same value as input, thus final output being
{'name': ['Samsung', 'Color TV']}.
Upvotes: 2