sayth
sayth

Reputation: 7048

Pipeline for item not JSON serializable

I am trying to write output of a scraped xml to json. The scrape fails due to an item not being serializable.

From this question its advised that you need to build a pipeline, answer not provided out of scope for question SO scrapy serializer

So referring to scrapy docs It illustrates an example, however the docs then advise not to use this

The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

If I go to feed exports this is shown

JSON

FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if you’re using JSON with large feeds.

My issue still remains as that as i understand is for executing from command line as such.

scrapy runspider myxml.py -o ~/items.json -t json

However, this creates the error I was aiming to use a pipeline to solve.

TypeError: <bound method SelectorList.extract of [<Selector xpath='.//@venue' data=u'Royal Randwick'>]> is not JSON serializable

How do I create the json pipeline to rectify the json serialize error?

This is my code.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
# https://stackoverflow.com/a/27391649/461887
import json


class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//meeting')
        items = []

        for site in sites:
            item = ConvXmlItem()
            item['venue'] = site.xpath('.//@venue').extract
            item['name'] = site.xpath('.//race/@id').extract()
            item['url'] = site.xpath('.//race/@number').extract()
            item['description'] = site.xpath('.//race/@distance').extract()
            items.append(item)

        return items


        # class JsonWriterPipeline(object):
        #
        #     def __init__(self):
        #         self.file = open('items.jl', 'wb')
        #
        #     def process_item(self, item, spider):
        #         line = json.dumps(dict(item)) + "\n"
        #         self.file.write(line)
        #         return item

Upvotes: 1

Views: 2175

Answers (1)

alecxe
alecxe

Reputation: 474001

The problem is here:

item['venue'] = site.xpath('.//@venue').extract

You've just forgot to call extract. Replace it with:

item['venue'] = site.xpath('.//@venue').extract()

Upvotes: 3

Related Questions