Reputation: 41
Im trying to parse a JSON into a pandas DataFrame
Here is my json:
In [1]: data
Out[1]:
[
{
"field_1": {
"key1": value1,
"key2": value2
},
"field_2": {
"key1": value3,
"key2": value4
},
"data_A": [
{
"key1": value5,
"key2": value6
},
{
"key1": value7,
"key2": value8
}
]
"data_B": [
{
"key1": value9,
"key2": value10
}
]
},
{
"field_1": {
"key1": value11,
"key2": value12
},
"field_2": {
"key1": value13,
"key2": value14
},
"data_B": [
{
"key1": value15,
"key2": value16
},
{
"key1": value17,
"key2": value18
}
]
},
...
]
As you can see, each of my object in data
contains field_1 and field_2. It also contains either data_A or data_B (at least one of the two). data_A, when it exists, is a list one or more elements and data_B, when it exists, is a list of one or more elements.
What I would like to have is the following result:
In [2]: df_A
Out[2]:
data_A.key1 data_A.key2 field_1.key_1 field_1.key_2 field_2.key_1 field_2.key_2
0 value5 value6 value1 value2 value3 value4
1 value7 value8 value1 value2 value3 value3
...
In [3]: df_B
Out[3]:
data_B.key1 data_B.key2 field_1.key_1 field_1.key_2 field_2.key_1 field_2.key_2
0 value9 value10 value1 value2 value3 value4
1 value15 value16 value11 value12 value13 value14
2 value17 value18 value11 value12 value13 value14
...
I can do it with for-loops, but is there a more efficient and elegant way?
Upvotes: 0
Views: 80
Reputation: 11
If you are on Linux or Mac with Python 3.6+, you can use a library (bamboo) that I wrote for wrangling nested data (e.g. JSON) into Pandas. If you are on Windows or using a different Python version, but are still interested in the library, let me know and I will see if I can build a more portable version.
You can install bamboo from pypi: pip install bamboo-nested
The conversion is entirely done in a C++ extenion and numpy, so it should be very performant.
Here is how it would work with the JSON you gave (I made a few small changes to make it a valid JSON example):
s = '''
[
{
"field_1": {
"key1": "value1",
"key2": "value2"
},
"field_2": {
"key1": "value3",
"key2": "value4"
},
"data_A": [
{
"key1": "value5",
"key2": "value6"
},
{
"key1": "value7",
"key2": "value8"
}
],
"data_B": [
{
"key1": "value9",
"key2": "value10"
}
]
},
{
"field_1": {
"key1": "value11",
"key2": "value12"
},
"field_2": {
"key1": "value13",
"key2": "value14"
},
"data_B": [
{
"key1": "value15",
"key2": "value16"
},
{
"key1": "value17",
"key2": "value18"
}
]
}
]
'''
from bamboo import from_json
tree = from_json(s)
df_A = tree.flatten(exclude=[tree.data_B]) # you can also refer to pieces of the tree with string names, i.e. "exclude=['data_B']"
df_B = tree.flatten(exclude=[tree.data_A])
print(df_A)
print(df_B)
The output produced is:
data_A_key1 data_A_key2 field_1_key1 field_1_key2 field_2_key1 field_2_key2
0 value5 value6 value1 value2 value3 value4
1 value7 value8 value1 value2 value3 value4
data_B_key1 data_B_key2 field_1_key1 field_1_key2 field_2_key1 field_2_key2
0 value9 value10 value1 value2 value3 value4
1 value15 value16 value11 value12 value13 value14
2 value17 value18 value11 value12 value13 value14
This is not exactly what you requested, but very close (it uses underscores by default for creating column names). You can change the column naming by passing another one into the "name_strategy" parameter of "flatten". I don't currently have an option for changing the default column name separator, but I can add that.
I hope this helps!
Upvotes: 1