GriMel
GriMel

Reputation: 2320

Scrapy convert from unicode to utf-8

I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code

class ArticleSpider(Spider):
    name = "article"
    allowed_domains = ["example.com"]
    start_urls = (
        "http://example.com/tag/1/page/1"
    )

    def parse(self, response):
        next_selector = response.xpath('//a[@class="next"]/@href')
        url = next_selector[1].extract()
        # url is like "tag/1/page/2"
        yield Request(urlparse.urljoin("http://example.com", url))

        item_selector = response.xpath('//h3/a/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin("http://example.com", url),
                      callback=self.parse_article)

    def parse_article(self, response):
        item = ItemLoader(item=Article(), response=response)
        # here i extract title of every article
        item.add_xpath('title', '//h1[@class="title"]/text()')
        return item.load_item()

I'm not pleased with the output, something like:

[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name> {'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}

I think I need to use custom ItemLoader class but I don't know how. Need your help.

TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8

Upvotes: 5

Views: 4377

Answers (2)

Frederic Bazin
Frederic Bazin

Reputation: 1529

There are 2 independant issues affecting display of unicode string.

  1. if you return a list of strings, the output file will have some issue them because it will use ascii codec by default to serialize list elements. You can work around as below but it's more appropriate to use extract_first() as suggested by @neverlastn

    class Article(Item):
        title = Field(serializer=lambda x: u', '.join(x))
    
  2. the default implementation of repr() method will serialize unicode string to their escaped version \uxxxx. You can change this behaviour by overriding this method in your item class

    class Article(Item):
        def __repr__(self):
            data = self.copy()
            for k in data.keys():
                if type(data[k]) is unicode:
                    data[k] = data[k].encode('utf-8')
            return super.__repr__(data)
    

Upvotes: 0

neverlastn
neverlastn

Reputation: 2204

As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)

$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya

In [7]: print response.xpath('//h1/text()').extract_first()
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"

In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'

What you see is two different representations of the same thing - a unicode string.

What I would suggest is run crawls with -L INFO or add LOG_LEVEL='INFO' to your settings.py in order to not show this output in the console.

One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.

$ scrapy crawl example -L INFO -o a.jl

gives you:

$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}

This is correct but it takes more space and most applications handle equally well non-escaped JSON.

Adding a few lines in your settings.py can change this behaviour:

from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
    def __init__(self, file, **kwargs):
        super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

FEED_EXPORTERS = {
    'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
    'jl': 'myproject.settings.MyJsonLinesItemExporter',
}

Essentially what we do is just setting ensure_ascii=False for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:

$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}

which is better-looking, equally valid and more compact.

Upvotes: 7

Related Questions