Reputation: 51

Inappropriate Encoding of response of Scrapy

When I tried out the Scrapy yesterday, I was trying to fetch the titles of the posts of a Chinese Ruby forum. But, somehow the outputs of the Scrapy are all Unicode, like

"[\u5317\u4eac][2017\u5e746\u670818\u65e5] Rails Girls"

I've checked out the encoding of the response is UTF-8 and I printed out the content of body which show the Chinese chars correctly.

So, I'm confused of that why I use Scrapy selector to pick the title and put the output to a Json file. Then, the content of the file are all character pointers, like \u5317. Any help will be appreciated. Thanks.

My code:

import scrapy

class MySpider(scrapy.Spider):
  name = 'myspider'
  start_urls = ['https://ruby-china.org/topics']

  def parse(self, response):
    self.logger.warning("body: %s", response.body)
    for topic in response.css('div.topic'):
        title = topic.css('div.media-heading')
        yield {'title': title.css('a ::attr(title)').extract_first()}

Upvotes: 1

Answers (1)

paul trmbrth

Reputation: 20748

When Scrapy calls your callback with a response for a URL, the response contains the decoded Unicode body content, as response.text, and the "raw" bytes of the received body, in whatever encoding was use, as response.body.

When you use Scrapy selectors that you get from response.xpath() or response.css() calls, and you call .extract() on them, you get Python Unicode strings.

Python 2.7 uses \uXXXX escapes sequences to represent them. That's what you see in the console logs of the yielded items.

But if you call print on those string, you see the characters themselves:

$ scrapy shell https://ruby-china.org/topics
2017-05-23 13:15:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
(...)
2017-05-23 13:15:33 [scrapy.core.engine] INFO: Spider opened
2017-05-23 13:15:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ruby-china.org/topics> (referer: None)
(...)
>>> for topic in response.css('div.topic'):
...     title = topic.css('div.media-heading')
...     print(title.css('a ::attr(title)').extract_first())
... 
[北京][2017年6月18日] Rails Girls 复活啦 2017 北京活动报名 | 少女们一天学编程
招 ruby 开发偏执狂,分享产品成果
challenge #1
[上海／成都] Le Wagon 编程训练营招聘 Ruby 导师，2200/ 天
量产型炮灰工程师
如果开发公众号内的小应用，rails 前端搭配哪个框架，vue？react？angular？
[长沙] Kdan Mobile 招聘 Ruby on Rails 工程师 (9K～15K)
Ruby 开发有什么新的进展吗？PHP 貌似要上 JIT 了！
这种需要强行增加对象阅读数，有其他建议吗？
rails 项目，production 模式在 ie8 下报"'undefined' 为空或不是对象"错误
pwc (sdc) 招后端，前端，区块链应用开发。
我想做个类似 app 中的消息中心,比如我下完订单,就会提示我订单的状态!
[上海] 郎客信息技术有限公司招聘 Rails 实习生 2 名
Rails 5.1 使用 yarn 和 webpack 实战 (vue, 构建等)
 [上海] 赛若福诚聘 Ruby 工程师
[上海&杭州] Change 健身潮流文化社区招收 Ruby 工程师 (15-40k 十四薪)
[宁波] 新希望软件 Ruby 工程师 3 名 [8k~12k] 
如何禁用下拉列表
為你自己學 Ruby on Rails
使用 RSpec 在 Rails 5 下测试邮件的发送
GitHub API v4 改用 GraphQL 了
[上海] 2017.5.21 Elixir Meetup
多态情况下关联表查询问题
Rails 与 Django 性能的疑问
[北京] 西单，金融方向，欢迎 Ruby 大牛 [15k~30k]
云梯正式开通 Telegram 官方频道
>>>

Now, if you export your items as JSON, for example with -o items.json, by default, Scrapy will as well write \uXXXX escape sequences in JSON strings of the different items. It's the same as how Python 2.7 represent non-ASCII characters, and that's 100% valid JSON output, and is actually the default for Python's json module (referred to as ensure_ascii).

If you really do need UTF-8 characters in the JSON output file, you can use Scrapy's FEED_EXPORT_ENCODING='utf-8' setting.

Upvotes: 2

Inappropriate Encoding of response of Scrapy

Answers (1)

Related Questions