Reading a json as Multiindex DataFrame

Question

I have a json which comprises like this.

[
    {
        "bi-document-identifier": "USA",
        "text": "I found somwthing Interesting",
        "relation_type": "ImportanceRelation",
        "annotation_type": "Opinionholder",
        "relation_start": {
            "annotatedtext": "We",
            "annotatedtext_represents": "OpinionHolder",
            "annotatedtext_startposition": 146,
            "annotatedtext_endposition": 148
        },
        "relation_end": {
            "annotatedtext": "information",
            "annotatedtext_represents": "StanceTarget",
            "annotatedtext_startposition": 239,
            "annotatedtext_endposition": 250
        }
    },
    {
        "bi-document-identifier": "USB",
        "text": "This is quite new to me",
        "relation_type": "ImportanceRelation",
        "annotation_type": "Opinionholder",
        "relation_start": {
            "annotatedtext": "Dr. A_PERSON",
            "annotatedtext_represents": "OpinionHolder",
            "annotatedtext_startposition": 0,
            "annotatedtext_endposition": 12
        },
        "relation_end": {
            "annotatedtext": "MOA of CV risk reduction",
            "annotatedtext_represents": "StanceTarget",
            "annotatedtext_startposition": 29,
            "annotatedtext_endposition": 53
        }
    }
]

how to read as a multi header dataframe. The orient and typ options in pandas.read_json() give me some errors like below

TypeError: list indices must be integers or slices, not str

Valdi_Bo · Accepted Answer

Start from read_json with records orientation:

df = pd.read_json('Input.json', orient='records')

Then convert both relation columns to "MultiIndexed" version:

d = { tt: df[tt].apply(pd.Series) for tt in ['relation_start', 'relation_end'] }
wrk1 = pd.concat(d, axis=1, keys=d.keys())

The added MultiIndex level is the top level (relation_start and relation_end) and index keys in source values become the second level.

The third step is to convert 4 initial columns, also to MultiIndex version, but this time the added level is the second level, containing only empty strings:

wrk2 = df.iloc[:, 0:4]
wrk2.columns = pd.MultiIndex.from_product([wrk2.columns, ['']])

And the last thing to do is to contatenate both wrk DataFrames (on the index):

result = wrk2.join(wrk1)

Or if you want a shorter code, don't create wrk1 (d is enough), and place corresponding expression in the final instruction:

result = wrk.join(pd.concat(d, axis=1, keys=d.keys()))

Note: json_normalize proposed in the other solution is undeniably shorter, but the result has plain (single level) column index. As I understood, you want just a MultiIndex on columns.

Reading a json as Multiindex DataFrame

Answers (2)

Related Questions