Litvinenko Evgeny
Litvinenko Evgeny

Reputation: 95

Scrapy yield utf-8

I am trying to rewrite an official Scrapy tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html) code with russian site habrahabr.ru.

Here is my code:

import scrapy


class DmozSpider(scrapy.Spider):
    name = 'habr'

    allowed_domains = ['habrahabr.ru']

    start_urls = [
        'http://habrahabr.ru/interesting/'
    ]

    def parse(self, response):
        yield {'title': response.xpath('//title/text()').extract()[0]}

it returns: {'title': u'\u0418\u043d\u0442\u0435\u0440\u0435\u0441\u043d\u044b\u0435 \u043f\u0443\u0431\u043b\u0438\u043a\u0430\u0446\u0438\u0438 / \u0425\u0430\u0431\u0440\u0430\u0445\u0430\u0431\u0440'}

when I try:

 yield {'title': response.xpath('//title/text()').extract()[0].encode('utf-8')}

returns:

{'title': '\xd0\x98\xd0\xbd\xd1\x82\xd0\xb5\xd1\x80\xd0\xb5\xd1\x81\xd0\xbd\xd1\x8b\xd0\xb5 \xd0\xbf\xd1\x83\xd0\xb1\xd0\xbb\xd0\xb8\xd0\xba\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8 / \xd0\xa5\xd0\xb0\xd0\xb1\xd1\x80\xd0\xb0\xd1\x85\xd0\xb0\xd0\xb1\xd1\x80'}

How can I change this behavior?

Upvotes: 4

Views: 1537

Answers (2)

Peyman
Peyman

Reputation: 4209

Go to the setting.py file and set FEED_EXPORT_ENCODING option to utf-8.

FEED_EXPORT_ENCODING = "utf-8"

This will solve your problem.

Upvotes: 5

Alisher Gafurov
Alisher Gafurov

Reputation: 447

If I get you right you are confused because the value you got not looks like a cyrillic text. But actually everything is fine you get the correct value. The string just automatically encoded to unicode. To see readable/cyrillic value you can do that:

#Python - 2
title = u'\u0418\u043d\u0442\u0435\u0440\u0435\u0441\u043d\u044b\u0435 \u043f\u0443\u0431\u043b\u0438\u043a\u0430\u0446\u0438\u0438 / \u0425\u0430\u0431\u0440\u0430\u0445\u0430\u0431\u0440'
print(title.encode('utf-8'))


#Python - 3
title = u'\u0418\u043d\u0442\u0435\u0440\u0435\u0441\u043d\u044b\u0435 \u043f\u0443\u0431\u043b\u0438\u043a\u0430\u0446\u0438\u0438 / \u0425\u0430\u0431\u0440\u0430\u0445\u0430\u0431\u0440'
print(title)

result would be the:

Интересные публикации / Хабрахабр

Upvotes: 1

Related Questions