Capi Etheriel
Capi Etheriel

Reputation: 3640

How to add data to a dict-like Item Field using ItemLoaders?

I'm using Scrapy's XPathItemLoader, but it's api only documents adding values to an Item Field, but not any deeper :( I mean:

def parse_item(self, response):
    loader = XPathItemLoader(response=response)
    loader.add_xpath('name', '//h1')

Will add the values found by the xpath to Item.name, but how to add them to Item.profile['name']?

Upvotes: 3

Views: 1721

Answers (2)

宏杰李
宏杰李

Reputation: 12168

This is default setting of scrapy.loader.Itemloader:

class ItemLoader(object):

    default_item_class = Item
    default_input_processor = Identity()
    default_output_processor = Identity()
    default_selector_class = Selector

when you use add_value add_xpath add_css, the input and output processor are Identity(), which means do nothing. so you can use add value:

name = response.xpath('//h1/text()').extract_first()
loader.add_value('profile', {'name':name})

Upvotes: 2

alecxe
alecxe

Reputation: 474001

XPathItemLoader.add_xpath doesn't support writing to nested fields. You should construct your profile dict manually and write it via add_value method (in case you still need to go with loaders). Or, you can write your own custom loader.

Here's an example using add_value:

from scrapy.contrib.loader import XPathItemLoader
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TestItem(Item):
    others = Field()


class WikiSpider(BaseSpider):
    name = "wiki"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        loader = XPathItemLoader(item=TestItem(), response=response)

        others = {}
        crawled_items = hxs.select('//div[@id="mp-other"]/ul/li/b/a')
        for item in crawled_items:
            href = item.select('@href').extract()[0]
            name = item.select('text()').extract()[0]
            others[name] = href

        loader.add_value('others', others)
        return loader.load_item()

Run it via: scrapy runspider <script_name> --output test.json.

The spider collects items of Other areas of Wikipedia from the main wikipedia page and writes it to the dictionary field others.

Hope that helps.

Upvotes: 3

Related Questions