Maksim Khaitovich
Maksim Khaitovich

Reputation: 4792

Python's UTF-8 encoding yields odd results even though explicit utf-8 encoding is used

I am parsing some JSON (specifically Amazon reviews file, which Amazon publicly provides). I am doing a line by line parsing with conversion to Pandas DataFrame and insert to SQL on the fly. I found something really odd. I use UTF-8 to open the json file. In the file itself when I open it with notepad I don't see any strange symbols or whatever. For example, substring of review:

The temperature control doesn’t hold to as tight a temperature as some of the others reported.

But when I parse it and check the contents of string:

The temperature control doesn\xe2\x80\x99t hold to as tight a temperature as some of the others reported. 

Why is that so? How I can't properly read it?

My current code is below:

def parseJSON(path):
  g = io.open(path,'r',encoding='utf8')
  for l in g:
      yield eval(l)



for l in parseJSON(r"reviews.json"):
    for review in l["reviews"]:
        df = {}
        df[l["url"]] = review["review"]
        dfInsert = pd.DataFrame( list(df.items()), columns = ["url", "Review"])

File subset which fails is there: http://www.filedropper.com/subset

Upvotes: 1

Views: 454

Answers (1)

randomir
randomir

Reputation: 18697

First of all, you should never parse a text from an unsafe (online) source with eval. If the data is in JSON, you should use a JSON parser. That's why JSON was invented - to provide a safe serialization and deserialization.

In your case, use json.load() from the standard json module:

import json

def parseJSON(path):
    return json.load(io.open(path, 'r', encoding='utf-8-sig'))

Since your JSON file contains a BOM, you should use the codec that knows how to strip it, i.e. the utf-8-sig.

If your file contains one JSON Object per line, you can read it like this:

def parseJSON(path):
    with io.open(path, 'r', encoding='utf-8-sig') as f:
        for line in f:
            yield json.loads(line)

Now to answer why are you seeing doesn\xe2\x80\x99t instead of doesn’t. If you decode the bytes \xe2\x80\x99 as UTF-8, you get:

>>> '\xe2\x80\x99'.decode('utf8')`
u'\u2019'

and what Unicode codepoint is that?

>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'

Ok, now what happens when you eval() it in Python 2? Well, first, note that Unicode is not really a first-class citizen in the land of Python 2 strings (Python 3 fixed that).

So, eval tries to parse the string (series of bytes in Python 2) as a Python expression:

>>> eval('"’"')
'\xe2\x80\x99'

Note that (in my console that uses UTF-8) even when I type , that's represented as a sequence of 3 bytes.

It doesn't even help to say it's supposed to be a unicode:

>>> eval('u"’"')
u'\xe2\x80\x99'

What will help is to tell Python how to interpret the series of bytes that follow in the source/string, i.e. what's the encoding (see PEP-263):

>>> eval('# encoding: utf-8\nu"’"')
u'\u2019'

Upvotes: 2

Related Questions