showkey
showkey

Reputation: 298

How to sort the scrapy item info in customized order?

The default order in scrapy is alphabet,i have read some post to use OrderedDict to output item in customized order.
I write a spider follow the webpage.
How to get order of fields in Scrapy item

My items.py.

import scrapy
from collections import OrderedDict


class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:  
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

class StockinfoItem(OrderedItem):
    name = scrapy.Field()
    phone = scrapy.Field()
    address = scrapy.Field()

The simple spider file.

import scrapy
from info.items import InfoItem

class InfoSpider(scrapy.Spider):
    name = 'Info'
    allowed_domains = ['quotes.money.163.com']
    start_urls = [ "http://quotes.money.163.com/f10/gszl_600023.html"]
    def parse(self, response):
        item = InfoItem()
        item["name"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[2]/text()').extract()
        item["phone"] = response.xpath('/html/body/div[2]/div[4]/table/tr[7]/td[4]/text()').extract()
        item["address"] = response.xpath('/html/body/div[2]/div[4]/table/tr[2]/td[4]/text()').extract()
        item.items()
        yield  item

The scrapy info when to run the spider.

2019-04-25 13:45:01 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],'name': ['浙能电力'],'phone': ['0571-87210223']}

Why i can't get such desired order as below?

{'name': ['浙能电力'],'phone': ['0571-87210223'],'address': ['浙江省杭州市天目山路152号浙能大厦']}

Thank for Gallaecio's advice, to add the following in settings.py.

FEED_EXPORT_FIELDS=['name','phone','address']

Execute the spider and output to csv file.

scrapy crawl  info -o  info.csv

The field order is in my customized order.

cat info.csv
name,phone,address
浙能电力,0571-87210223,浙江省杭州市天目山路152号浙能大

Look at the scrapy's debug info :

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'address': ['浙江省杭州市天目山路152号浙能大厦'],
 'name': ['浙能电力'],
 'phone': ['0571-87210223']}

How can i make the debug info in customized order?How to get the following debug output?

2019-04-26 00:16:38 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{'name': ['浙能电力'],
 'phone': ['0571-87210223'],
 'address': ['浙江省杭州市天目山路152号浙能大厦'],}

Upvotes: 5

Views: 2501

Answers (4)

vezunchik
vezunchik

Reputation: 3717

Problem is in __repr__ function of Item. Originally its code is:

def __repr__(self):
    return pformat(dict(self))

So even if you convert your item to OrderedDict and expect fields to be saved in the same order, this function applies dict() to it and breaks the order.

So, I propose you to overload it in the way you like, for example:

import json

class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

    def __repr__(self):
        return json.dumps(OrderedDict(self), ensure_ascii = False)  # it should return some string

And now you can get this output:

2019-04-30 18:56:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.money.163.com/f10/gszl_600023.html>
{"name": ["\u6d59\u80fd\u7535\u529b"], "phone": ["0571-87210223"], "address": ["\u6d59\u6c5f\u7701\u676d\u5dde\u5e02\u5929\u76ee\u5c71\u8def152\u53f7\u6d59\u80fd\u5927\u53a6"]}

Upvotes: 3

showkey
showkey

Reputation: 298

The whole items.py which can output customized dubug info in cjk apperance is as below.

import scrapy
import json    
from collections import OrderedDict

class OrderedItem(scrapy.Item):
    def __init__(self, *args, **kwargs):
        self._values = OrderedDict()
        if args or kwargs:
            for k, v in six.iteritems(dict(*args, **kwargs)):
                self[k] = v

    def __repr__(self):
        return json.dumps(OrderedDict(self),ensure_ascii = False)  
        #ensure_ascii = False ,it make characters show in cjk appearance.

class StockinfoItem(OrderedItem):
    name = scrapy.Field()
    phone = scrapy.Field()
    address = scrapy.Field()

Upvotes: 0

Rasko Vučinić
Rasko Vučinić

Reputation: 79

In your spider replace item.items() with self.log(item.items()), log msg should be list of tuples in order you assigned them in your spider.

Another way is to combine answer you mentioned in your post with this answer

Upvotes: 0

Raphael
Raphael

Reputation: 1801

you can define a custom string representation of your item

class InfoItem:
    def __repr__(self):
      return 'name: {}, phone: {}, address: {}'.format(self['name'], self.['phone'], self.['address'])

Upvotes: 1

Related Questions