Reputation: 183
I want to do is load a json file of forex historical price data by Pandas and do statistic with the data. I have go through many topics on Pandas and parsing json file. I want to pass a json file with extra value and nested list to a pandas dataframe.
I got a json file 'EUR_JPY_H8.json'
First I import the lib that required,
import pandas as pd
import json
from pandas.io.json import json_normalize
Then load the json file,
with open('EUR_JPY_H8.json') as data_file:
data = json.load(data_file)
I got a list below:
[{u'complete': True,
u'mid': {u'c': u'119.743',
u'h': u'119.891',
u'l': u'119.249',
u'o': u'119.341'},
u'time': u'1488319200.000000000',
u'volume': 14651},
{u'complete': True,
u'mid': {u'c': u'119.893',
u'h': u'119.954',
u'l': u'119.552',
u'o': u'119.738'},
u'time': u'1488348000.000000000',
u'volume': 10738},
{u'complete': True,
u'mid': {u'c': u'119.946',
u'h': u'120.221',
u'l': u'119.840',
u'o': u'119.888'},
u'time': u'1488376800.000000000',
u'volume': 10041}]
Then I pass the list to json_normalize
. Try to get price which is in the nested list under 'mid'
result = json_normalize(data,'time',['time','volume','complete',['mid','h'],['mid','l'],['mid','c'],['mid','o']])
But I got such result,
The 'time' data got breakdown into each integer row by row.
I have checked related document. I have to pass a string or list object to the 2nd parameter of json_normalize
. How can I pass the timestamp there without breaking down?
The columns of my expected output are:
index | time | volumn | completed | mid.h | mid.l | mid.c | mid.o
Upvotes: 18
Views: 95007
Reputation: 23041
The data in the OP (after deserialized from a json string preferably using json.load()
) is a list of nested dictionaries, which is an ideal data structure for pd.json_normalize()
because it converts a list of dictionaries and flattens each dictionary into a single row. So the length of the list determines the number of rows and the total number of key-value pairs in the dictionaries determine the number of columns.
However, if a value under some key is a list, then that no longer is true because presumably the items in the those lists need to be in their separate rows. For example, if my_data.json
file is like:
# my_data.json
[
{'price': {'mid': ['119.743', '119.891', '119.341'], 'time': '123'}},
{'price': {'mid': ['119.893', '119.954', '119.552'], 'time': '456'}},
{'price': {'mid': ['119.946', '120.221', '119.840'], 'time': '789'}}
]
and then you'll want to put each value in the list as its own row. In that case, you can pass the path to these lists as record_path=
argument. Also, you can make each record have its accompanying metadata, whose path you can also pass as meta=
argument.
# deserialize json into a python data structure
import json
with open('my_data.json', 'r') as f:
data = json.load(f)
# normalize the python data structure
df = pd.json_normalize(data, record_path=['price', 'mid'], meta=[['price', 'time']], record_prefix='mid.')
Ultimately, pd.json_normalize()
cannot handle anything more complex than this kind of structure. For example, it cannot add another metadata to the above example if it's nested inside another dictionary. Depending on the data, you'll most probably need a recursive function to parse it (FYI, pd.json_normalize()
is a recursive function as well but it's for a general case and won't work for a lot of specific objects).
Often times, you'll need a combination of explode()
, pd.DataFrame(col.tolist())
etc. to completely parse the data.
Pandas also has a convenience function pd.read_json()
as well but it's even more limited than pd.json_normalize()
in that it can only correctly parse a json array of one nesting level. Unlike pd.json_normalize()
however, it deserializes a json string under the hood so you can directly pass the path to a json file to it (no need for json.load()
). In other words, the following two produce the same output:
df1 = pd.read_json("my_data.json")
df2 = pd.json_normalize(data, max_level=0) # here, `data` is deserialized `my_data.json`
df1.equals(df2) # True
Upvotes: 9
Reputation: 402273
You could just pass data
without any extra params.
df = pd.io.json.json_normalize(data)
df
complete mid.c mid.h mid.l mid.o time volume
0 True 119.743 119.891 119.249 119.341 1488319200.000000000 14651
1 True 119.893 119.954 119.552 119.738 1488348000.000000000 10738
2 True 119.946 120.221 119.840 119.888 1488376800.000000000 10041
If you want to change the column order, use df.reindex
:
df = df.reindex(columns=['time', 'volume', 'complete', 'mid.h', 'mid.l', 'mid.c', 'mid.o'])
df
time volume complete mid.h mid.l mid.c mid.o
0 1488319200.000000000 14651 True 119.891 119.249 119.743 119.341
1 1488348000.000000000 10738 True 119.954 119.552 119.893 119.738
2 1488376800.000000000 10041 True 120.221 119.840 119.946 119.888
Upvotes: 28