konichiwa
konichiwa

Reputation: 551

Convert pandas hierarchical DataFrame to JSON

I have a pandas DataFrame that contains hierarchical elements from my database. parent_id columns tells me to what parent each child belongs. I would like to generate a nested JSON from this DataFrame (see section 3 for needed format). I tried many things, but I am mainly stuck on the _children list field in each parent, which should contain all its children. Any help is more than welcome, Thanks!

1. pandas DataFrame (WHAT I HAVE)

parent_id    child_id   level
125582       214659     1
125582       214633     1
125582       214263     1
214263       131673     2
214263       125579     2
214263       125578     2
214263       172670     2
214263       214266     2
214266       216675     3
214266       216671     3
172670       172669     3
172670       174777     3
172670       207661     3
207661       216529     4
207661       223884     4
223884       223885     5
223885       229186     6
229186       219062     7
229186       222243     7
and so on...

2. DataFrame logic

125582 (level 0 - root)
--214659 (level 1)
--214633 (level 1)
--214263 (level 1)
----131673 (level 2)
----125579 (level 2)
----125578 (level 2)
----172670 (level 2)
------172669 (level 3)
------174777 (level 3)
------207661 (level 3)
--------216529 (level 4)
--------223884 (level 4)
----------223885 (level 5)
------------229186 (level 6)
--------------219062 (level 7)    
--------------222243 (level 7)
----214266 (level 2)
------216675 (level 3)
------216671 (level 3)
and so on...

3. JSON (WHAT I NEED)

[
    {id: 125582, level: 0, "_children": [
        {id: 214659, level: 1},
        {id: 214633, level: 1},
        {id: 214263, level: 1, "_children": [
            {id: 131673, level: 2},
            {id: 125579, level: 2},
            {id: 125578, level: 2},
            {id: 172670, level: 2, "_children": [
                {id: 172669, level: 3},
                {id: 174777, level: 3},
                {id: 207661, level: 3, "_children": [
                    {id: 216529, level: 4},
                    {id: 223884, level: 4, "_children": [
                        {id: 223885, level: 5, "_children": [
                            {id: 229186, level: 6, "_children": [
                                {id: 219062, level: 7},
                                {id: 222243, level: 7}
                            ]}
                        ]}
                    ]}
                ]}
            ]},
            {id: 214266, level: 2, "_children": [
                {id: 216675, level: 3},
                {id: 216671, level: 3}
            ]}
        ]},
        and so on...
    ]}
]

Upvotes: 1

Views: 1729

Answers (3)

Andy L.
Andy L.

Reputation: 25249

This hierarchical json has parent->child relationship. Therefore, I suggest you use networkx package to construct a graph g from the dateframe. From graph g create the desired json.

You need to pip install networkx following this instruction

After that, do the following steps:

  • import networkx as nx
  • create the directed graph g using from_pandas_edgelist
  • create a dictionary d with unique key on parent_id and child_id with values being as level
  • assign d to nodes attributes level of graph g
  • finally, call nx.tree_data on g

The code as follows:

import networkx as nx

#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0  #set level of root to `0`

#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level')    #add `level` values to nodes of `g`

out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[119]:
[{'level': 0,
  'id': 125582,
  '_children': [{'level': 1, 'id': 214659},
   {'level': 1, 'id': 214633},
   {'level': 1,
    'id': 214263,
    '_children': [{'level': 2, 'id': 131673},
     {'level': 2, 'id': 125579},
     {'level': 2, 'id': 125578},
     {'level': 2,
      'id': 172670,
      '_children': [{'level': 3, 'id': 172669},
       {'level': 3, 'id': 174777},
       {'level': 3,
        'id': 207661,
        '_children': [{'level': 4, 'id': 216529},
         {'level': 4,
          'id': 223884,
          '_children': [{'level': 5,
            'id': 223885,
            '_children': [{'level': 6,
              'id': 229186,
              '_children': [{'level': 7, 'id': 219062},
               {'level': 7, 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'id': 214266,
      '_children': [{'level': 3, 'id': 216675},
       {'level': 3, 'id': 216671}]}]}]}]

Note: the output above is the dictionary inside of list as your desired output. My solution output puts the key level in front of key id, but the hierarchical/structural of sub-dictionaries is the same as your desired output


On the other hand, if you don't need the level values in the desired output, just directly use the built-in nx.tree_data function to return the the output as follows

import networkx as nx

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

root_id = df.loc[0, 'parent_id']    
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[168]:
[{'id': 125582,
  '_children': [{'id': 214659},
   {'id': 214633},
   {'id': 214263,
    '_children': [{'id': 131673},
     {'id': 125579},
     {'id': 125578},
     {'id': 172670,
      '_children': [{'id': 172669},
       {'id': 174777},
       {'id': 207661,
        '_children': [{'id': 216529},
         {'id': 223884,
          '_children': [{'id': 223885,
            '_children': [{'id': 229186,
              '_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
     {'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]

Update:

Use this modified function of nx.tree_data to overcome the restriction on single parent per child. Add this customized function to your code and call it instead of nx.tree_data

from itertools import chain

def tree_data_custom(G, root):
    id_ = 'id'
    children = '_children'
    def add_children(n, G):
        nbrs = G[n]
        if len(nbrs) == 0:
            return []
        children_ = []
        for child in nbrs:
            d = dict(chain(G.nodes[child].items(), [(id_, child)]))
            c = add_children(child, G)
            if c:
                d[children] = c
            children_.append(d)
        return children_

    data = dict(chain(G.nodes[root].items(), [(id_, root)]))    
    data[children] = add_children(root, G)
    return data

Instead of call out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})], just call tree_data_custom as follows

out =  [tree_data_custom(g, root_id)]

Update 2: Adding name column to dataframe

sample df where children has multiple parents

Out[258]:
    parent_id  child_id  level name
0      125582    214659      1   a1
1      125582    214633      1   a1
2      125582    214263      1   a1
3      214263    131673      2   a2
4      214263    125579      2   a2
5      214263    125578      2   a2
6      214263    172670      2   a2
7      214263    214266      2   a2
8      214266    216675      3   a3
9      214266    216671      3   a3
10     172670    172669      3   a3
11     172670    174777      3   a3
12     172670    207661      3   a3
13     207661    216529      4   a4
14     207661    223884      4   a4
15     223884    223885      5   a5
16     223885    229186      6   a6
17     229186    219062      7   a7
18     229186    222243      7   a7
19     222243    219187      8   a8
20     222243    245985      8   a8
21     222243    232393      8   a8
22     222243    247138      8   a8
23     222243    228848      8   a8
24     222243    228848      8   a8
25     222243    233920      8   a8
26     222243    233920      8   a8
27     222243    228113      8   a8
28     222243    233767      8   a8
29     222243    235407      8   a8
30     222243    237757      8   a8
31     222243    159091      8   a8
32     222243    159091      8   a8
33     222243    214832      8   a8
34     222243    253990      8   a8
35     222243    231610      8   a8
36     222243    231610      8   a8
37     222243    182323      8   a8
38     222243    242190      8   a8
39     222243    143580      8   a8
40     222243    242188      8   a8
41     222243    143581      8   a8
42     222243    242187      8   a8
43     222243    143582      8   a8
44     222243    242189      8   a8
45     222243    205877      8   a8
46     222243    242823      8   a8
47     222243    140979      8   a8
48     222243    237824      8   a8
49     222243    149933      8   a8
50     222243    149933      8   a8
51     222243    153625      8   a8
52     222243      8392      8   a8
53     222243    162085      8   a8
54     222243    162085      8   a8
55     222243    150691      8   a8
56     222243    147773      8   a8
57     222243    147773      8   a8
58     222243     61070      8   a8
59     222243     61070      8   a8
60     222243    204850      8   a8
61     222243    204850      8   a8
62      61070     46276      9   a9
63      61070     46276      9   a9
64      61070     46276      9   a9
65      61070     46276      9   a9
66     143580    159911      9   a9
67     143580     38958      9   a9
68     182323    159911      9   a9

The change is minimal, you only need to modify the step create dictionaries for nodes attributes of g. Currently, we only add attribute level to g. Now, we need to create another dictionary for name and add it to the new attribute name of nodes of g

#create the directed graph `g` using `from_pandas_edgelist`. 
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0  #set `level` of root to `0`

#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level')    #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name')    #add `name` values to nodes of `g`

#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)] 

If you adding many more columns to each row, it is better to create a single dictionary for all columns and apply one time to nodes of g as follows

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0  #set level of root to `0`

nx.set_node_attributes(g, d)    #a single add for both attributes `level`, `name`

out = [tree_data_custom(g, root_id)] 

Output

Out[260]:
[{'level': 0,
  'name': 'a1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'a1', 'id': 214659},
   {'level': 1, 'name': 'a1', 'id': 214633},
   {'level': 1,
    'name': 'a1',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'a2', 'id': 131673},
     {'level': 2, 'name': 'a2', 'id': 125579},
     {'level': 2, 'name': 'a2', 'id': 125578},
     {'level': 2,
      'name': 'a2',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'a3', 'id': 172669},
       {'level': 3, 'name': 'a3', 'id': 174777},
       {'level': 3,
        'name': 'a3',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'a4', 'id': 216529},
         {'level': 4,
          'name': 'a4',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'a5',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'a6',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'a7', 'id': 219062},
               {'level': 7,
                'name': 'a7',
                'id': 222243,
                '_children': [{'level': 8, 'name': 'a8', 'id': 219187},
                 {'level': 8, 'name': 'a8', 'id': 245985},
                 {'level': 8, 'name': 'a8', 'id': 232393},
                 {'level': 8, 'name': 'a8', 'id': 247138},
                 {'level': 8, 'name': 'a8', 'id': 228848},
                 {'level': 8, 'name': 'a8', 'id': 233920},
                 {'level': 8, 'name': 'a8', 'id': 228113},
                 {'level': 8, 'name': 'a8', 'id': 233767},
                 {'level': 8, 'name': 'a8', 'id': 235407},
                 {'level': 8, 'name': 'a8', 'id': 237757},
                 {'level': 8, 'name': 'a8', 'id': 159091},
                 {'level': 8, 'name': 'a8', 'id': 214832},
                 {'level': 8, 'name': 'a8', 'id': 253990},
                 {'level': 8, 'name': 'a8', 'id': 231610},
                 {'level': 8,
                  'name': 'a8',
                  'id': 182323,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
                 {'level': 8, 'name': 'a8', 'id': 242190},
                 {'level': 8,
                  'name': 'a8',
                  'id': 143580,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911},
                   {'level': 9, 'name': 'a9', 'id': 38958}]},
                 {'level': 8, 'name': 'a8', 'id': 242188},
                 {'level': 8, 'name': 'a8', 'id': 143581},
                 {'level': 8, 'name': 'a8', 'id': 242187},
                 {'level': 8, 'name': 'a8', 'id': 143582},
                 {'level': 8, 'name': 'a8', 'id': 242189},
                 {'level': 8, 'name': 'a8', 'id': 205877},
                 {'level': 8, 'name': 'a8', 'id': 242823},
                 {'level': 8, 'name': 'a8', 'id': 140979},
                 {'level': 8, 'name': 'a8', 'id': 237824},
                 {'level': 8, 'name': 'a8', 'id': 149933},
                 {'level': 8, 'name': 'a8', 'id': 153625},
                 {'level': 8, 'name': 'a8', 'id': 8392},
                 {'level': 8, 'name': 'a8', 'id': 162085},
                 {'level': 8, 'name': 'a8', 'id': 150691},
                 {'level': 8, 'name': 'a8', 'id': 147773},
                 {'level': 8,
                  'name': 'a8',
                  'id': 61070,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
                 {'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
     {'level': 2,
      'name': 'a2',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'a3', 'id': 216675},
       {'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]

Update 3: handle columns parent_name and child_name

You need to melt at once 4 columns parent_name, child_name, parent_id, parent_id. It is functionality of wide_to_long

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.rename(lambda x: '_'.join(x.split('_')[::-1]), axis=1)
df1 = pd.wide_to_long(df1.reset_index(), stubnames=['id', 'name'],
                i='index', j='type', suffix='\w+', sep='_')
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.id, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0

nx.set_node_attributes(g, d)
out = [tree_data_custom(g, root_id)] 

Out[91]:
[{'level': 0,
  'name': 'word1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'word6', 'id': 214659},
   {'level': 1, 'name': 'word7', 'id': 214633},
   {'level': 1,
    'name': 'word2',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'word8', 'id': 131673},
     {'level': 2, 'name': 'word9', 'id': 125579},
     {'level': 2, 'name': 'word10', 'id': 125578},
     {'level': 2,
      'name': 'word4',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'word13', 'id': 172669},
       {'level': 3, 'name': 'word14', 'id': 174777},
       {'level': 3,
        'name': 'word5',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'word15', 'id': 216529},
         {'level': 4,
          'name': 'word16',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'word17',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'word18',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'word19', 'id': 219062},
               {'level': 7, 'name': 'word20', 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'name': 'word3',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'word11', 'id': 216675},
       {'level': 3, 'name': 'word12', 'id': 216671}]}]}]}]

Upvotes: 4

James Daily
James Daily

Reputation: 146

It's not a pandas one-liner, but here's a recursive function that should do the trick.

import json

import pandas as pd

df = pd.DataFrame(
    [{"parent_id": 125582, "child_id": 214659, "level": 1}, {"parent_id": 125582, "child_id": 214633, "level": 1},
     {"parent_id": 125582, "child_id": 214263, "level": 1}, {"parent_id": 214263, "child_id": 131673, "level": 2},
     {"parent_id": 214263, "child_id": 125579, "level": 2}, {"parent_id": 214263, "child_id": 125578, "level": 2},
     {"parent_id": 214263, "child_id": 172670, "level": 2}, {"parent_id": 214263, "child_id": 214266, "level": 2},
     {"parent_id": 214266, "child_id": 216675, "level": 3}, {"parent_id": 214266, "child_id": 216671, "level": 3},
     {"parent_id": 172670, "child_id": 172669, "level": 3}, {"parent_id": 172670, "child_id": 174777, "level": 3},
     {"parent_id": 172670, "child_id": 207661, "level": 3}, {"parent_id": 207661, "child_id": 216529, "level": 4},
     {"parent_id": 207661, "child_id": 223884, "level": 4}, {"parent_id": 223884, "child_id": 223885, "level": 5},
     {"parent_id": 223885, "child_id": 229186, "level": 6}, {"parent_id": 229186, "child_id": 219062, "level": 7},
     {"parent_id": 229186, "child_id": 222243, "level": 7}]
)


def recurse(parent_id, level):
    # create the base result
    result = {"id": int(parent_id), "level": int(level)}

    # get all of the children of this parent, one level below this one
    children = df[(df.parent_id == parent_id) & (df["level"] == level + 1)]

    # if there are no such children, then return without a _children key
    if children.empty:
        return result

    # otherwise, recurse on each child_id
    result["_children"] = [recurse(child_id, level + 1) for child_id in sorted(children.child_id.unique())]

    return result


# start at an implicit level 0
tree = [recurse(parent_id, 0) for parent_id in sorted(df[df["level"] == 1].parent_id.unique())]

# convert to JSON
print(json.dumps(tree, indent=3, sort_keys=True))

Upvotes: 2

phi
phi

Reputation: 11714

id_to_obj = {}
roots = []
def parse(row):
    pid, cid, lvl = row.parent_id, row.child_id, row.level
    if pid not in id_to_obj:
        # parent not created yet: create parent as root
        id_to_obj[pid] = {
            'id': pid, 'level': 0
        }
        roots.append(pid)

    # Get parent object
    parent = id_to_obj[pid]
    if not '_children' in parent:
        parent['_children'] = []

    # Create child object and add to parent's children list
    child = {
        'id': cid, 'level': lvl
    }
    parent['_children'].append(child)
    id_to_obj[cid] = child

df.sort_values(by='level').apply(parse, axis=1)

# Result
[id_to_obj[pid] for pid in roots]

Upvotes: 2

Related Questions