In Pandas UnicodeDecodeError Cannot decode Unicode Ascii in JSON file using pandas.read_json()

Question

Thanks for your help in advance. I am trying to read a JSON file into a pandas DataFrane and getting a cornucopia of unicode/ascii errors. Edit: The error appears to lie in the fact that the JSON file is multi line with each line its own JSON object.

With a data file that looks like:

"data.json" = 

{"_i":{"$o":"5b"},"c_id":"10","p_id":"10","c_c":2,"l_c":59,"u":{"n":"J","id":"1"},"c_t":"2010","m":"Hopefully 

EDIT: Actually."}
{"_i":{"$o":"5b"},"p_id":"10","c_id":"10","p_id":"10","c_c":0,"l_c":8,"u":{"n":"S","id":"1"},"c_t":"2010","m":"in-laws?"}

Edit: In response to a comment, the above is not code to be run, it is included as a sample of my datafile, that is saved as a json file.

As this is a multiple line file, per this link Loading a file with more than one line of JSON into Python's Pandas I tried to use

import pandas
df = pandas.read_json('data.json', lines = True)

Gives the error:

    json = u'[' + u','.join(lines) + u']'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 436: ordinal not in range(128)

According to this issue highlighted on GitHub https://github.com/pandas-dev/pandas/issues/15132, this is because:

This can happen in Python 2.7 if the default encoding is set to ascii (check sys.getdefaultencoding()). StringIO will convert the input string to ascii when lines=True, resulting in a UnicodeDecodeError because of mixing utf-8 and ascii strings.

Their solution is to change the system encoding to utf-8 from ascii, however, I understand that this is inadvisable - source:Changing default encoding of Python?.

I also tried changing the encoding both to utf-8 / ascii within read_json() but to no avail.

How can I successfully read this json file into a pandas DataFrame, preserving the multi-line structure?

Many thanks!

user3023715 · Accepted Answer

People are so cranky on here sometimes. Ok so in python 2.7, it defaults to ascii and you can use the following line to see that:

encoding = sys.getdefaultencoding()
print encoding

It appears that they made a fix for this in pandas by allowing you to set the encoding like:

pd.read_json(the_file, encoding = encoding)

Unfortunately, that line doesn't seem to work either.

So rather than depend on pandas we can do it ourselves as well. All the "lines" option does is tack on square brackets at the end and join by commas ( i.e. [{},{},{}] ).

First, read in the data and strip it:

with open(path+theFile, 'rb') as f:
    data = f.readlines()

data = map(lambda x: x.rstrip(), data)

Python read in the lines no problem with encoding. We can then use the same code from pandas to do the lines:

data_lines = "[" + ','.join(data) + "]"

Then read the lines into the parser like normal:

df = pd.read_json(data_lines)

BTW, none of this is an issue in python 3

In Pandas UnicodeDecodeError Cannot decode Unicode Ascii in JSON file using pandas.read_json()

Answers (2)

Related Questions