chongzixin
chongzixin

Reputation: 1981

Reading nested json into dataframe

I have a json file exported from Firebase that looks like the below.

{
  "reports" : {
    "Google-Pixel 2 XL" : {
      "-MIoCtD9YUF2G9Esfrfz" : {
        "message" : "04 Oct 2020 23:25:17:047 onCreate MainActivity",
        "timestamp" : 1601825117067
      },
      "-MIoCtFVOxu8wdEHtm6q" : {
        "message" : "04 Oct 2020 23:25:17:214 onCreate Service",
        "timestamp" : 1601825117216
      },
      "-MIoCyBtKMQqQzUHEXsW" : {
        "message" : "04 Oct 2020 23:25:37:682 onStartCommand Service",
        "timestamp" : 1601825137685
      },
      "-MIoFWll9r3qwzWNoGMn" : {
        "message" : "04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)",
        "timestamp" : 1601825807693
      }
    },
    "Vivo 1820" : {
      "-MIoF14JUm6JMZrOzDlL" : {
        "message" : "04 Oct 2020 23:34:37:623 onCreate MainActivity",
        "timestamp" : 1601825677653
      },
      "-MIoF1A9ZZNqTu5W-rQD" : {
        "message" : "04 Oct 2020 23:34:38:016 onCreate Service",
        "timestamp" : 1601825678026
      },
      "-MIoF2gNDua9FfLBTg6q" : {
        "message" : "04 Oct 2020 23:34:44:235 onCreate MainActivity",
        "timestamp" : 1601825684248
      }
    }
  }
}

I am trying to convert it into a 4 column dataframe that looks like the below

model             | id                   | message                               | timestamp
Google-Pixel 2 XL | -MIoCtD9YUF2G9Esfrfz | 04 Oct 2020 23:25:17:047 onCreate...  | 1601825117067
Vivo 1820         | -MIoF14JUm6JMZrOzDlL | 04 Oct 2020 23:34:37:623 onCreate...  | 1601825677653

How do I do that? I tried various ways and with normalizing but can't seem to get it.

data = pd.read_json("firebase-file.json")
df = pd.json_normalize(data, record_path="reports")

Thank you.

Upvotes: 2

Views: 297

Answers (3)

Henry Yik
Henry Yik

Reputation: 22493

I think you need to flatten it yourself, good thing it isn't complicated:

s = [[k, i, *j.values()] for k,v in data["reports"].items() for i, j in v.items()]

print (pd.DataFrame(s))

                   0                     1                                                  2              3
0  Google-Pixel 2 XL  -MIoCtD9YUF2G9Esfrfz     04 Oct 2020 23:25:17:047 onCreate MainActivity  1601825117067
1  Google-Pixel 2 XL  -MIoCtFVOxu8wdEHtm6q          04 Oct 2020 23:25:17:214 onCreate Service  1601825117216
2  Google-Pixel 2 XL  -MIoCyBtKMQqQzUHEXsW    04 Oct 2020 23:25:37:682 onStartCommand Service  1601825137685
3  Google-Pixel 2 XL  -MIoFWll9r3qwzWNoGMn  04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)  1601825807693
4          Vivo 1820  -MIoF14JUm6JMZrOzDlL     04 Oct 2020 23:34:37:623 onCreate MainActivity  1601825677653
5          Vivo 1820  -MIoF1A9ZZNqTu5W-rQD          04 Oct 2020 23:34:38:016 onCreate Service  1601825678026
6          Vivo 1820  -MIoF2gNDua9FfLBTg6q     04 Oct 2020 23:34:44:235 onCreate MainActivity  1601825684248

Upvotes: 3

Bill Huang
Bill Huang

Reputation: 4648

According to the official docs of pd.json_normalize(), it assumes an array (list) input. However, the original json is far from something like list of dicts, and most importantly, the key "id" does not exist. Therefore I think a hand-made parser is definitely needed.

Code:

import pandas as pd
import json

file_path = "/mnt/ramdisk/in.json"

with open(file_path) as f:
    dic = json.load(f)

# discard the redundant "report" layer
dic = dic["reports"]

# produce a flattened list of dict
ls = []
for k1, v1 in dic.items():
    # k1 = model
    for k2, v2 in v1.items():
        # k2 = the hash-like id
        v2["model"] = k1
        v2["id"] = k2
        ls.append(v2)

df = pd.json_normalize(ls)

Output

# Trim the message for printing purpose
df2 = df.copy()
df2["message"] = df["message"].apply(lambda s: s[:10])
df2

Out[28]: 
      message      timestamp              model                    id
0  04 Oct 202  1601825117067  Google-Pixel 2 XL  -MIoCtD9YUF2G9Esfrfz
1  04 Oct 202  1601825117216  Google-Pixel 2 XL  -MIoCtFVOxu8wdEHtm6q
2  04 Oct 202  1601825137685  Google-Pixel 2 XL  -MIoCyBtKMQqQzUHEXsW
3  04 Oct 202  1601825807693  Google-Pixel 2 XL  -MIoFWll9r3qwzWNoGMn
4  04 Oct 202  1601825677653          Vivo 1820  -MIoF14JUm6JMZrOzDlL
5  04 Oct 202  1601825678026          Vivo 1820  -MIoF1A9ZZNqTu5W-rQD
6  04 Oct 202  1601825684248          Vivo 1820  -MIoF2gNDua9FfLBTg6q

N.B. Going into the layer as deep as where the hash-like id locates seems to be necessary. This is because the id's are keys originally, but it seems like they have to be reformatted as values in order to be correctly interpreted as values by pd.json_normalize. My simple survey on the Internet also found no example using a simple built-in method for parsing such a recursive structure.

Upvotes: 0

chrymxbrwn
chrymxbrwn

Reputation: 98

Try something this (see my comment above)

import pandas as pd


data = []
for k, v in test['reports'].items():
    model_name = k
    for model in v.items():
        _data = {}
        _data['model'] = model_name
        _data['id'] = model[0]
        _data['message'] = model[1]['message']
        _data['timestamp'] = model[1]['timestamp']
        data.append(_data)
df = pd.DataFrame(data)

Where test is your data, thus test['reports'] accesses the nested information you want to parse

Upvotes: 0

Related Questions