Reputation: 1981
I have a json
file exported from Firebase that looks like the below.
{
"reports" : {
"Google-Pixel 2 XL" : {
"-MIoCtD9YUF2G9Esfrfz" : {
"message" : "04 Oct 2020 23:25:17:047 onCreate MainActivity",
"timestamp" : 1601825117067
},
"-MIoCtFVOxu8wdEHtm6q" : {
"message" : "04 Oct 2020 23:25:17:214 onCreate Service",
"timestamp" : 1601825117216
},
"-MIoCyBtKMQqQzUHEXsW" : {
"message" : "04 Oct 2020 23:25:37:682 onStartCommand Service",
"timestamp" : 1601825137685
},
"-MIoFWll9r3qwzWNoGMn" : {
"message" : "04 Oct 2020 23:36:47:687: (1.3212517, 103.860314)",
"timestamp" : 1601825807693
}
},
"Vivo 1820" : {
"-MIoF14JUm6JMZrOzDlL" : {
"message" : "04 Oct 2020 23:34:37:623 onCreate MainActivity",
"timestamp" : 1601825677653
},
"-MIoF1A9ZZNqTu5W-rQD" : {
"message" : "04 Oct 2020 23:34:38:016 onCreate Service",
"timestamp" : 1601825678026
},
"-MIoF2gNDua9FfLBTg6q" : {
"message" : "04 Oct 2020 23:34:44:235 onCreate MainActivity",
"timestamp" : 1601825684248
}
}
}
}
I am trying to convert it into a 4 column dataframe that looks like the below
model | id | message | timestamp
Google-Pixel 2 XL | -MIoCtD9YUF2G9Esfrfz | 04 Oct 2020 23:25:17:047 onCreate... | 1601825117067
Vivo 1820 | -MIoF14JUm6JMZrOzDlL | 04 Oct 2020 23:34:37:623 onCreate... | 1601825677653
How do I do that? I tried various ways and with normalizing but can't seem to get it.
data = pd.read_json("firebase-file.json")
df = pd.json_normalize(data, record_path="reports")
Thank you.
Upvotes: 2
Views: 297
Reputation: 22493
I think you need to flatten it yourself, good thing it isn't complicated:
s = [[k, i, *j.values()] for k,v in data["reports"].items() for i, j in v.items()]
print (pd.DataFrame(s))
0 1 2 3
0 Google-Pixel 2 XL -MIoCtD9YUF2G9Esfrfz 04 Oct 2020 23:25:17:047 onCreate MainActivity 1601825117067
1 Google-Pixel 2 XL -MIoCtFVOxu8wdEHtm6q 04 Oct 2020 23:25:17:214 onCreate Service 1601825117216
2 Google-Pixel 2 XL -MIoCyBtKMQqQzUHEXsW 04 Oct 2020 23:25:37:682 onStartCommand Service 1601825137685
3 Google-Pixel 2 XL -MIoFWll9r3qwzWNoGMn 04 Oct 2020 23:36:47:687: (1.3212517, 103.860314) 1601825807693
4 Vivo 1820 -MIoF14JUm6JMZrOzDlL 04 Oct 2020 23:34:37:623 onCreate MainActivity 1601825677653
5 Vivo 1820 -MIoF1A9ZZNqTu5W-rQD 04 Oct 2020 23:34:38:016 onCreate Service 1601825678026
6 Vivo 1820 -MIoF2gNDua9FfLBTg6q 04 Oct 2020 23:34:44:235 onCreate MainActivity 1601825684248
Upvotes: 3
Reputation: 4648
According to the official docs of pd.json_normalize()
, it assumes an array (list) input. However, the original json is far from something like list of dicts, and most importantly, the key "id" does not exist. Therefore I think a hand-made parser is definitely needed.
Code:
import pandas as pd
import json
file_path = "/mnt/ramdisk/in.json"
with open(file_path) as f:
dic = json.load(f)
# discard the redundant "report" layer
dic = dic["reports"]
# produce a flattened list of dict
ls = []
for k1, v1 in dic.items():
# k1 = model
for k2, v2 in v1.items():
# k2 = the hash-like id
v2["model"] = k1
v2["id"] = k2
ls.append(v2)
df = pd.json_normalize(ls)
Output
# Trim the message for printing purpose
df2 = df.copy()
df2["message"] = df["message"].apply(lambda s: s[:10])
df2
Out[28]:
message timestamp model id
0 04 Oct 202 1601825117067 Google-Pixel 2 XL -MIoCtD9YUF2G9Esfrfz
1 04 Oct 202 1601825117216 Google-Pixel 2 XL -MIoCtFVOxu8wdEHtm6q
2 04 Oct 202 1601825137685 Google-Pixel 2 XL -MIoCyBtKMQqQzUHEXsW
3 04 Oct 202 1601825807693 Google-Pixel 2 XL -MIoFWll9r3qwzWNoGMn
4 04 Oct 202 1601825677653 Vivo 1820 -MIoF14JUm6JMZrOzDlL
5 04 Oct 202 1601825678026 Vivo 1820 -MIoF1A9ZZNqTu5W-rQD
6 04 Oct 202 1601825684248 Vivo 1820 -MIoF2gNDua9FfLBTg6q
N.B. Going into the layer as deep as where the hash-like id
locates seems to be necessary. This is because the id
's are keys
originally, but it seems like they have to be reformatted as values
in order to be correctly interpreted as values by pd.json_normalize
. My simple survey on the Internet also found no example using a simple built-in method for parsing such a recursive structure.
Upvotes: 0
Reputation: 98
Try something this (see my comment above)
import pandas as pd
data = []
for k, v in test['reports'].items():
model_name = k
for model in v.items():
_data = {}
_data['model'] = model_name
_data['id'] = model[0]
_data['message'] = model[1]['message']
_data['timestamp'] = model[1]['timestamp']
data.append(_data)
df = pd.DataFrame(data)
Where test
is your data, thus test['reports']
accesses the nested information you want to parse
Upvotes: 0