The Singularity
The Singularity

Reputation: 2698

Reading JSON (with nested arrays) in Pandas

I'm trying to load a JSON file message_1.json which has data structured as shown below:

{
  "participants": [
    {
      "name": "X"
    },
    {
      "name": "Y"
    }
  ],
  "messages": [
    {
      "sender_name": "X",
      "timestamp_ms": 1608148658236,
      "content": "Hi",
      "type": "Generic"
    },
    {
      "sender_name": "Y",
      "timestamp_ms": 1608148604578,
      "content": "Hey",
      "type": "Generic"
    },
    {
      "sender_name": "X",
      "timestamp_ms": 1608148599875,
      "content": "Bye",
      "type": "Generic"
    },

    ...
    ...
    ...

I have used df = pd.read_json("message_1.json") which results in ValueError: arrays must all be same length.

I believe a similar question has also been asked here and I've tried using the solution specified

with open('message_1.json') as json_data:
    data = json.load(json_data) # Data is successfully loaded here in the above format

pd.DataFrame.from_dict(data, orient='timestamp_ms').T.set_index('timestamp_ms') 

As I had no index column in the data, I set the index to timestamp_ms resulting in an error ValueError: only recognize index or columns for orient

I presume I've loaded the JSON in the wrong orientation.

Please Advise.

Upvotes: 1

Views: 288

Answers (1)

drops
drops

Reputation: 1604

From the json-input, it seems you only want the keep the "messages" part.

import json
import pandas as pd

with open('message_1.json') as json_data:
    data = json.load(json_data)

messages = data['messages']

# messages is a list, not a dict
df = pd.DataFrame(messages)

df.set_index('timestamp_ms', inplace=True)

output:

              sender_name content     type
timestamp_ms                              
1608148658236           X      Hi  Generic
1608148604578           Y     Hey  Generic
1608148599875           X     Bye  Generic

Upvotes: 1

Related Questions