Reputation: 7048
I am trying to write output of a scraped xml to json. The scrape fails due to an item not being serializable.
From this question its advised that you need to build a pipeline, answer not provided out of scope for question SO scrapy serializer
So referring to scrapy docs It illustrates an example, however the docs then advise not to use this
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
If I go to feed exports this is shown
JSON
FEED_FORMAT: json Exporter used: JsonItemExporter See this warning if you’re using JSON with large feeds.
My issue still remains as that as i understand is for executing from command line as such.
scrapy runspider myxml.py -o ~/items.json -t json
However, this creates the error I was aiming to use a pipeline to solve.
TypeError: <bound method SelectorList.extract of [<Selector xpath='.//@venue' data=u'Royal Randwick'>]> is not JSON serializable
How do I create the json pipeline to rectify the json serialize error?
This is my code.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
# https://stackoverflow.com/a/27391649/461887
import json
class MyxmlSpider(scrapy.Spider):
name = "myxml"
start_urls = (
["file:///home/sayth/Downloads/20160123RAND0.xml"]
)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//meeting')
items = []
for site in sites:
item = ConvXmlItem()
item['venue'] = site.xpath('.//@venue').extract
item['name'] = site.xpath('.//race/@id').extract()
item['url'] = site.xpath('.//race/@number').extract()
item['description'] = site.xpath('.//race/@distance').extract()
items.append(item)
return items
# class JsonWriterPipeline(object):
#
# def __init__(self):
# self.file = open('items.jl', 'wb')
#
# def process_item(self, item, spider):
# line = json.dumps(dict(item)) + "\n"
# self.file.write(line)
# return item
Upvotes: 1
Views: 2175
Reputation: 474001
The problem is here:
item['venue'] = site.xpath('.//@venue').extract
You've just forgot to call extract
. Replace it with:
item['venue'] = site.xpath('.//@venue').extract()
Upvotes: 3