Olga Botvinnik
Olga Botvinnik

Reputation: 1644

pandas read json not working on MultiIndex

I'm trying to read in a dataframe created via df.to_json() via pd.read_json but I'm getting a ValueError. I think it may have to do with the fact that the index is a MultiIndex but I'm not sure how to deal with that.

The original dataframe of 55k rows is called psi and I created test.json via:

psi.head().to_json('test.json')

Hereis the output of print psi.head().to_string() if you want to use that.

When I do it on this small set of data (5 rows), I get a ValueError.

! wget --no-check-certificate https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json
import pandas as pd
pd.read_json('test.json')

Here's the full stack:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-1de2f0e65268> in <module>()
      1 get_ipython().system(u' wget https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json'>)
      2 import pandas as pd
----> 3 pd.read_json('test.json')

/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit)
    196         obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
    197                           keep_default_dates, numpy, precise_float,
--> 198                           date_unit).parse()
    199 
    200     if typ == 'series' or obj is None:

/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in parse(self)
    264 
    265         else:
--> 266             self._parse_no_numpy()
    267 
    268         if self.obj is None:

/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in _parse_no_numpy(self)
    481         if orient == "columns":
    482             self.obj = DataFrame(
--> 483                 loads(json, precise_float=self.precise_float), dtype=None)
    484         elif orient == "split":
    485             decoded = dict((str(k), v)

ValueError: No ':' found when decoding object value

> /home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.py(483)_parse_no_numpy()
    482             self.obj = DataFrame(
--> 483                 loads(json, precise_float=self.precise_float), dtype=None)
    484         elif orient == "split":

But when I do it on the whole dataframe (55k rows) then I get an invalid pointer error and the IPython kernel dies. Any ideas?

EDIT: added how the json was generated in the first place.

Upvotes: 12

Views: 9735

Answers (4)

David Bieber
David Bieber

Reputation: 33

this was my simple dirty fix for encoding/decoding multiindex pandas dataframe which seems to also work for datetime in index/columns... not optimized!

here is the encoder to json - I encoder the dataframe, index and columns into a dict to create a json

import json
import pandas as pd

def to_json_multiindex(df):
    dfi = df.index.to_frame()
    dfc = df.columns.to_frame()

    d = dict(
        df = df.to_json(),
        di = dfi.to_json(),
        dc = dfc.to_json()   
    )

    return json.dumps(d)

meanwhile here is the decoder which reads the json dict and re-creates the dataframe

def read_json_multiindex(j):
    d = json.loads(j)

    di=pd.read_json(d['di'])
    if di.shape[1]>1:
        di = pd.MultiIndex.from_frame(di)
    else:
        _name = di.columns[0]
        di = di.index
        di.name = _name
        
        
    dc=pd.read_json(d['dc'])
    if dc.shape[1]>1:
        dc = pd.MultiIndex.from_frame(dc)
    else:
        _name = dc.columns[0]
        dc = dc.index
        dc.name = _name        
    
    df = pd.read_json(d['df']).values
    
    return pd.DataFrame(
        data=df,
        index=di,
        columns=dc,
    )

and here is a test for multiindex columns and index... seems to preserve the dataframe. Couple of issues 1) probably inefficient and 2) does seem to work for datatime in multiindex (but works when it isn't multiindex)

    df = pd.DataFrame(
        data = [[0,1,2],[2,3,4],[5,6,7]],
        index = pd.MultiIndex.from_tuples(
           (('aa','bb'),('aa','cc'),('bb','cc')
        ),
        names=['AA','BB']),
        columns = pd.MultiIndex.from_tuples(
           (('XX','YY'),('XX','ZZ'),('YY','ZZ')
        ),
        names=['YY','ZZ'])
    )
    
    j = to_json_multiindex(df)
    d = read_json_multiindex(j)

Upvotes: 0

if you want to return MultiIndex structure:

 # save MultiIndex indexes names 
 indexes_names = df.index.names

 df.reset_index().to_json('dump.json')

 # return back MultiIndex structure:
 loaded_df = pd.read_json('dump.json').set_index(indexes_names)

Upvotes: -1

as - if
as - if

Reputation: 3297

Or you can just write json with orient = 'table'

df.to_json(path_or_buf='test.json', orient='table')

read multi_index json

pd.read_json('test.json', orient='table')

Upvotes: 2

Jeff
Jeff

Reputation: 128988

This is not implemented ATM, see the issue here: https://github.com/pydata/pandas/issues/4889.

You can simply reset the index first, e.g

df.reset_index().to_json(...)

and it will work.

Upvotes: 10

Related Questions