Reputation: 4792
I am parsing some JSON (specifically Amazon reviews file, which Amazon publicly provides). I am doing a line by line parsing with conversion to Pandas DataFrame and insert to SQL on the fly. I found something really odd. I use UTF-8 to open the json file. In the file itself when I open it with notepad I don't see any strange symbols or whatever. For example, substring of review:
The temperature control doesn’t hold to as tight a temperature as some of the others reported.
But when I parse it and check the contents of string:
The temperature control doesn\xe2\x80\x99t hold to as tight a temperature as some of the others reported.
Why is that so? How I can't properly read it?
My current code is below:
def parseJSON(path):
g = io.open(path,'r',encoding='utf8')
for l in g:
yield eval(l)
for l in parseJSON(r"reviews.json"):
for review in l["reviews"]:
df = {}
df[l["url"]] = review["review"]
dfInsert = pd.DataFrame( list(df.items()), columns = ["url", "Review"])
File subset which fails is there: http://www.filedropper.com/subset
Upvotes: 1
Views: 454
Reputation: 18697
First of all, you should never parse a text from an unsafe (online) source with eval
. If the data is in JSON, you should use a JSON parser. That's why JSON was invented - to provide a safe serialization and deserialization.
In your case, use json.load()
from the standard json
module:
import json
def parseJSON(path):
return json.load(io.open(path, 'r', encoding='utf-8-sig'))
Since your JSON file contains a BOM, you should use the codec that knows how to strip it, i.e. the utf-8-sig
.
If your file contains one JSON Object per line, you can read it like this:
def parseJSON(path):
with io.open(path, 'r', encoding='utf-8-sig') as f:
for line in f:
yield json.loads(line)
Now to answer why are you seeing doesn\xe2\x80\x99t
instead of doesn’t
. If you decode the bytes \xe2\x80\x99
as UTF-8, you get:
>>> '\xe2\x80\x99'.decode('utf8')`
u'\u2019'
and what Unicode codepoint is that?
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
Ok, now what happens when you eval()
it in Python 2? Well, first, note that Unicode is not really a first-class citizen in the land of Python 2 strings (Python 3 fixed that).
So, eval
tries to parse the string (series of bytes in Python 2) as a Python expression:
>>> eval('"’"')
'\xe2\x80\x99'
Note that (in my console that uses UTF-8) even when I type ’
, that's represented as a sequence of 3 bytes.
It doesn't even help to say it's supposed to be a unicode
:
>>> eval('u"’"')
u'\xe2\x80\x99'
What will help is to tell Python how to interpret the series of bytes that follow in the source/string, i.e. what's the encoding (see PEP-263):
>>> eval('# encoding: utf-8\nu"’"')
u'\u2019'
Upvotes: 2