aka
aka

Reputation: 41

Shaping json into pandas DataFrame in an elegant way (without for-loops)

Im trying to parse a JSON into a pandas DataFrame

Here is my json:

In [1]: data
Out[1]: 
[
  {
    "field_1": {
      "key1": value1,
      "key2": value2
    },
    "field_2": {
      "key1": value3,
      "key2": value4
    },
    "data_A": [
      {
        "key1": value5,
        "key2": value6
      },
      {
        "key1": value7,
        "key2": value8
      }
    ]
    "data_B": [
      {
        "key1": value9,
        "key2": value10
      }
    ]
  },
  {
    "field_1": {
      "key1": value11,
      "key2": value12
    },
    "field_2": {
      "key1": value13,
      "key2": value14
    },
    "data_B": [
      {
        "key1": value15,
        "key2": value16
      },
      {
        "key1": value17,
        "key2": value18
      }
    ]
  },
  ... 
]

As you can see, each of my object in data contains field_1 and field_2. It also contains either data_A or data_B (at least one of the two). data_A, when it exists, is a list one or more elements and data_B, when it exists, is a list of one or more elements.

What I would like to have is the following result:

In [2]: df_A
Out[2]: 
   data_A.key1  data_A.key2  field_1.key_1  field_1.key_2  field_2.key_1  field_2.key_2
0       value5       value6         value1         value2         value3         value4 
1       value7       value8         value1         value2         value3         value3
...

In [3]: df_B
Out[3]: 

   data_B.key1  data_B.key2  field_1.key_1  field_1.key_2  field_2.key_1  field_2.key_2
0       value9      value10         value1         value2         value3         value4
1      value15      value16        value11        value12        value13        value14
2      value17      value18        value11        value12        value13        value14
...

I can do it with for-loops, but is there a more efficient and elegant way?

Upvotes: 0

Views: 80

Answers (1)

Michael Vilim
Michael Vilim

Reputation: 11

If you are on Linux or Mac with Python 3.6+, you can use a library (bamboo) that I wrote for wrangling nested data (e.g. JSON) into Pandas. If you are on Windows or using a different Python version, but are still interested in the library, let me know and I will see if I can build a more portable version.

You can install bamboo from pypi: pip install bamboo-nested

The conversion is entirely done in a C++ extenion and numpy, so it should be very performant.

Here is how it would work with the JSON you gave (I made a few small changes to make it a valid JSON example):

s = '''
[
    {
        "field_1": {
            "key1": "value1",
            "key2": "value2"
        },
        "field_2": {
            "key1": "value3",
            "key2": "value4"
        },
        "data_A": [
            {
                "key1": "value5",
                "key2": "value6"
            },
            {
                "key1": "value7",
                "key2": "value8"
            }
        ],
        "data_B": [
            {
                "key1": "value9",
                "key2": "value10"
            }
        ]
    },
    {
        "field_1": {
            "key1": "value11",
            "key2": "value12"
        },
        "field_2": {
            "key1": "value13",
            "key2": "value14"
        },
        "data_B": [
            {
                "key1": "value15",
                "key2": "value16"
            },
            {
                "key1": "value17",
                "key2": "value18"
            }
        ]
    }
]
'''

from bamboo import from_json
tree = from_json(s)
df_A = tree.flatten(exclude=[tree.data_B])  # you can also refer to pieces of the tree with string names, i.e. "exclude=['data_B']"
df_B = tree.flatten(exclude=[tree.data_A])
print(df_A)
print(df_B)

The output produced is:

  data_A_key1 data_A_key2 field_1_key1 field_1_key2 field_2_key1 field_2_key2
0      value5      value6       value1       value2       value3       value4
1      value7      value8       value1       value2       value3       value4
  data_B_key1 data_B_key2 field_1_key1 field_1_key2 field_2_key1 field_2_key2
0      value9     value10       value1       value2       value3       value4
1     value15     value16      value11      value12      value13      value14
2     value17     value18      value11      value12      value13      value14

This is not exactly what you requested, but very close (it uses underscores by default for creating column names). You can change the column naming by passing another one into the "name_strategy" parameter of "flatten". I don't currently have an option for changing the default column name separator, but I can add that.

I hope this helps!

Upvotes: 1

Related Questions