Reputation: 695

Not "\u": How to Unescape Unicode in JSON?

I'm trying to scrape from a non-English website using Scrapy. The scraped results as JSON look something like this:

{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"},

This is the code I'm using:

def parse(self, response):
    for sel in response.xpath('//section[@class="items-box"]'):
      item = ShopItem()
      item['name'] = sel.xpath('a/div/h3/text()').extract()
      item['price'] = sel.xpath('a/div/div/div[1]/text()').extract().replace("$", "")
      yield item

How would I output unescaped Unicode characters onto the JSON?

Upvotes: 5

Answers (2)

paul trmbrth

Reputation: 20748

Edit (2016-10-19):

With Scrapy 1.2+, you can use the FEED_EXPORT_ENCODING set to the character encoding you need for the output JSON file, e.g FEED_EXPORT_ENCODING = 'utf-8' (the default value being None, which means \uXXXX escaping)

Note: I'm adapting what I wrote on GitHub for a similar issue I linked to in the question's comments.

Note that there's an open issue on Scrapy to make the output encoding a parameter: https://github.com/scrapy/scrapy/issues/1965

Scrapy's default JSON exporter uses (the default) ensure_ascii=True argument, so it outputs Unicode characters as \uXXXX sequences before writing to file. (This is what is used when doing -o somefile.json)

Setting ensure_ascii=False in the exporter will output Unicode strings, which will end up as UTF-8 encoded on file. See custom exporter code at the bottom here.

To illustrate, let's read your input JSON string back into some data to work on:

>>> import json
>>> test = r'''{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'''
>>> json.loads(test)
{u'price': u'13,000', u'name': u'\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc'}

The input with \uXXXX sequences is valid JSON for Python (as it should), and loads() produces a valid Python dict.

Now let's serialize to JSON again:

>>> # dumping the dict back to JSON, with default ensure_ascii=True
>>> json.dumps(json.loads(test))
'{"price": "13,000", "name": "\\u58c1\\u6bb4\\u308a\\u4ee3\\u884c\\u69d8\\u5c02\\u7528\\u2605 \\u30c6\\u30ec\\u30d3\\u672c\\u4f53 20v\\u578b \\u767d \\u9001\\u6599\\u8fbc"}'
>>>

And now with ensure_ascii=False

>>> # now dumping with ensure_ascii=False, you get a Unicode string
>>> json.dumps(json.loads(test), ensure_ascii=False)
u'{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}'
>>>

Let's print to see the difference:

>>> print json.dumps(json.loads(test))
{"price": "13,000", "name": "\u58c1\u6bb4\u308a\u4ee3\u884c\u69d8\u5c02\u7528\u2605 \u30c6\u30ec\u30d3\u672c\u4f53 20v\u578b \u767d \u9001\u6599\u8fbc"}

>>> print json.dumps(json.loads(test), ensure_ascii=False)
{"price": "13,000", "name": "壁殴り代行様専用★ テレビ本体 20v型 白 送料込"}

If you want to write JSON items as UTF-8, you can do it like this:

1.. define a custom item exporter, e.g. in an exporters.py file in your project

$ cat myproject/exporters.py 
from scrapy.exporters import JsonItemExporter


class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)

2.. replace the default JSON item exporter in your settings.py

FEED_EXPORTERS = {
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}

Upvotes: 7

ShadowRanger

Reputation: 155487

Use the codecs module for text -> text decoding (In Python 2 it's not strictly necessary, but in Python 3 str doesn't have a decode method, because the methods are for str -> bytes and back, not str -> str). Using the unicode_escape codec for decoding will get you the correct data back:

import codecs

somestr = codecs.decode(strwithescapes, 'unicode-escape')

So to fix the names you're getting, you'd do:

item['name'] = codecs.decode(sel.xpath('a/div/h3/text()').extract(), 'unicode-escape')

If the problem is in JSON you're producing, you'd want to just make sure the json module isn't forcing strings to be ASCII with character encodings; it does so by default because not all JSON parsers can handle true Unicode characters (they often assume data is sent as ASCII bytes with escapes). So wherever you call json.dump/json.dumps (or create a json.JSONEncoder), make sure to explicitly pass ensure_ascii=False.

Upvotes: 1

Not &quot;\u&quot;: How to Unescape Unicode in JSON?

Answers (2)

Related Questions

Not "\u": How to Unescape Unicode in JSON?