Reputation: 695
I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},
This is the code I'm using:
def parse(self, response):
for sel in response.xpath('//section[@class="items-box"]'):
item = ShopItem()
item['name'] = sel.xpath('a/div/h3/text()').extract()
item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
yield item
How would I output unescaped Unicode characters onto the JSON?
Upvotes: 5
Views: 5979
Reputation: 20748
Edit (2016-10-19):
With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING
set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8'
(the default value being None
, which means \uXXXX
escaping)
Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.
Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965
Scrapy's default JSON exporter uses (the default) ensure_ascii=True
argument, so it outputs Unicode characters as \uXXXX
sequences before writing to file. (This is what is used when doing -o somefile.json
)
Setting ensure_ascii=False
in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.
To illustrate, let's read your input JSON string back into some data to work on:
>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}
The input with \uXXXX
sequences is valid JSON for Python (as it should), and loads()
produces a valid Python dict
.
Now let's serialize to JSON again:
>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>
And now with ensure_ascii=False
>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>
Let's print to see the difference:
>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}
>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}
If you want to write JSON items as UTF-8, you can do it like this:
1.. define a custom item exporter, e.g. in an exporters.py
file in your project
$ cat myproject/exporters.py
from scrapy.exporters import JsonItemExporter
class Utf8JsonItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
super(Utf8JsonItemExporter, self).__init__(
file, ensure_ascii=False, **kwargs)
2.. replace the default JSON item exporter in your settings.py
FEED_EXPORTERS = {
'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Upvotes: 7
Reputation: 155487
Use the codecs
module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str
doesn't have a decode
method, because the methods are for str
-> bytes
and back, not str
-> str
). Using the unicode_escape
codec for decoding will get you the correct data back:
import codecs
somestr = codecs.decode(strwithescapes, 'unicode-escape')
So to fix the names you're getting, you'd do:
item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')
If the problem is in JSON you're producing, you'd want to just make sure the json
module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump
/json.dumps
(or create a json.JSONEncoder
), make sure to explicitly pass ensure_ascii=False
.
Upvotes: 1