Reputation: 551
I have a pandas DataFrame
that contains hierarchical elements from my database. parent_id
columns tells me to what parent each child belongs.
I would like to generate a nested JSON from this DataFrame (see section 3 for needed format). I tried many things, but I am mainly stuck on the _children
list field in each parent, which should contain all its children. Any help is more than welcome, Thanks!
1. pandas DataFrame (WHAT I HAVE)
parent_id child_id level
125582 214659 1
125582 214633 1
125582 214263 1
214263 131673 2
214263 125579 2
214263 125578 2
214263 172670 2
214263 214266 2
214266 216675 3
214266 216671 3
172670 172669 3
172670 174777 3
172670 207661 3
207661 216529 4
207661 223884 4
223884 223885 5
223885 229186 6
229186 219062 7
229186 222243 7
and so on...
2. DataFrame logic
125582 (level 0 - root)
--214659 (level 1)
--214633 (level 1)
--214263 (level 1)
----131673 (level 2)
----125579 (level 2)
----125578 (level 2)
----172670 (level 2)
------172669 (level 3)
------174777 (level 3)
------207661 (level 3)
--------216529 (level 4)
--------223884 (level 4)
----------223885 (level 5)
------------229186 (level 6)
--------------219062 (level 7)
--------------222243 (level 7)
----214266 (level 2)
------216675 (level 3)
------216671 (level 3)
and so on...
3. JSON (WHAT I NEED)
[
{id: 125582, level: 0, "_children": [
{id: 214659, level: 1},
{id: 214633, level: 1},
{id: 214263, level: 1, "_children": [
{id: 131673, level: 2},
{id: 125579, level: 2},
{id: 125578, level: 2},
{id: 172670, level: 2, "_children": [
{id: 172669, level: 3},
{id: 174777, level: 3},
{id: 207661, level: 3, "_children": [
{id: 216529, level: 4},
{id: 223884, level: 4, "_children": [
{id: 223885, level: 5, "_children": [
{id: 229186, level: 6, "_children": [
{id: 219062, level: 7},
{id: 222243, level: 7}
]}
]}
]}
]}
]},
{id: 214266, level: 2, "_children": [
{id: 216675, level: 3},
{id: 216671, level: 3}
]}
]},
and so on...
]}
]
Upvotes: 1
Views: 1729
Reputation: 25249
This hierarchical json has parent->child relationship. Therefore, I suggest you use networkx
package to construct a graph g
from the dateframe. From graph g
create the desired json.
You need to pip
install networkx
following this instruction
After that, do the following steps:
g
using from_pandas_edgelist
d
with unique key on parent_id
and child_id
with values being as level
d
to nodes attributes level
of graph g
nx.tree_data
on g
The code as follows:
import networkx as nx
#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0 #set level of root to `0`
#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level') #add `level` values to nodes of `g`
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[119]:
[{'level': 0,
'id': 125582,
'_children': [{'level': 1, 'id': 214659},
{'level': 1, 'id': 214633},
{'level': 1,
'id': 214263,
'_children': [{'level': 2, 'id': 131673},
{'level': 2, 'id': 125579},
{'level': 2, 'id': 125578},
{'level': 2,
'id': 172670,
'_children': [{'level': 3, 'id': 172669},
{'level': 3, 'id': 174777},
{'level': 3,
'id': 207661,
'_children': [{'level': 4, 'id': 216529},
{'level': 4,
'id': 223884,
'_children': [{'level': 5,
'id': 223885,
'_children': [{'level': 6,
'id': 229186,
'_children': [{'level': 7, 'id': 219062},
{'level': 7, 'id': 222243}]}]}]}]}]},
{'level': 2,
'id': 214266,
'_children': [{'level': 3, 'id': 216675},
{'level': 3, 'id': 216671}]}]}]}]
Note: the output above is the dictionary inside of list as your desired output. My solution output puts the key level
in front of key id
, but the hierarchical/structural of sub-dictionaries is the same as your desired output
On the other hand, if you don't need the level
values in the desired output, just directly use the built-in nx.tree_data
function to return the the output as follows
import networkx as nx
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
root_id = df.loc[0, 'parent_id']
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[168]:
[{'id': 125582,
'_children': [{'id': 214659},
{'id': 214633},
{'id': 214263,
'_children': [{'id': 131673},
{'id': 125579},
{'id': 125578},
{'id': 172670,
'_children': [{'id': 172669},
{'id': 174777},
{'id': 207661,
'_children': [{'id': 216529},
{'id': 223884,
'_children': [{'id': 223885,
'_children': [{'id': 229186,
'_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
{'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]
Update:
Use this modified function of nx.tree_data
to overcome the restriction on single parent per child. Add this customized function to your code and call it instead of nx.tree_data
from itertools import chain
def tree_data_custom(G, root):
id_ = 'id'
children = '_children'
def add_children(n, G):
nbrs = G[n]
if len(nbrs) == 0:
return []
children_ = []
for child in nbrs:
d = dict(chain(G.nodes[child].items(), [(id_, child)]))
c = add_children(child, G)
if c:
d[children] = c
children_.append(d)
return children_
data = dict(chain(G.nodes[root].items(), [(id_, root)]))
data[children] = add_children(root, G)
return data
Instead of call out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
, just call tree_data_custom
as follows
out = [tree_data_custom(g, root_id)]
Update 2: Adding name
column to dataframe
sample df
where children has multiple parents
Out[258]:
parent_id child_id level name
0 125582 214659 1 a1
1 125582 214633 1 a1
2 125582 214263 1 a1
3 214263 131673 2 a2
4 214263 125579 2 a2
5 214263 125578 2 a2
6 214263 172670 2 a2
7 214263 214266 2 a2
8 214266 216675 3 a3
9 214266 216671 3 a3
10 172670 172669 3 a3
11 172670 174777 3 a3
12 172670 207661 3 a3
13 207661 216529 4 a4
14 207661 223884 4 a4
15 223884 223885 5 a5
16 223885 229186 6 a6
17 229186 219062 7 a7
18 229186 222243 7 a7
19 222243 219187 8 a8
20 222243 245985 8 a8
21 222243 232393 8 a8
22 222243 247138 8 a8
23 222243 228848 8 a8
24 222243 228848 8 a8
25 222243 233920 8 a8
26 222243 233920 8 a8
27 222243 228113 8 a8
28 222243 233767 8 a8
29 222243 235407 8 a8
30 222243 237757 8 a8
31 222243 159091 8 a8
32 222243 159091 8 a8
33 222243 214832 8 a8
34 222243 253990 8 a8
35 222243 231610 8 a8
36 222243 231610 8 a8
37 222243 182323 8 a8
38 222243 242190 8 a8
39 222243 143580 8 a8
40 222243 242188 8 a8
41 222243 143581 8 a8
42 222243 242187 8 a8
43 222243 143582 8 a8
44 222243 242189 8 a8
45 222243 205877 8 a8
46 222243 242823 8 a8
47 222243 140979 8 a8
48 222243 237824 8 a8
49 222243 149933 8 a8
50 222243 149933 8 a8
51 222243 153625 8 a8
52 222243 8392 8 a8
53 222243 162085 8 a8
54 222243 162085 8 a8
55 222243 150691 8 a8
56 222243 147773 8 a8
57 222243 147773 8 a8
58 222243 61070 8 a8
59 222243 61070 8 a8
60 222243 204850 8 a8
61 222243 204850 8 a8
62 61070 46276 9 a9
63 61070 46276 9 a9
64 61070 46276 9 a9
65 61070 46276 9 a9
66 143580 159911 9 a9
67 143580 38958 9 a9
68 182323 159911 9 a9
The change is minimal, you only need to modify the step create dictionaries for nodes attributes of g
. Currently, we only add attribute level
to g
. Now, we need to create another dictionary for name
and add it to the new attribute name
of nodes of g
#create the directed graph `g` using `from_pandas_edgelist`.
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0 #set `level` of root to `0`
#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level') #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name') #add `name` values to nodes of `g`
#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)]
If you adding many more columns to each row, it is better to create a single dictionary for all columns and apply one time to nodes of g
as follows
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0 #set level of root to `0`
nx.set_node_attributes(g, d) #a single add for both attributes `level`, `name`
out = [tree_data_custom(g, root_id)]
Output
Out[260]:
[{'level': 0,
'name': 'a1',
'id': 125582,
'_children': [{'level': 1, 'name': 'a1', 'id': 214659},
{'level': 1, 'name': 'a1', 'id': 214633},
{'level': 1,
'name': 'a1',
'id': 214263,
'_children': [{'level': 2, 'name': 'a2', 'id': 131673},
{'level': 2, 'name': 'a2', 'id': 125579},
{'level': 2, 'name': 'a2', 'id': 125578},
{'level': 2,
'name': 'a2',
'id': 172670,
'_children': [{'level': 3, 'name': 'a3', 'id': 172669},
{'level': 3, 'name': 'a3', 'id': 174777},
{'level': 3,
'name': 'a3',
'id': 207661,
'_children': [{'level': 4, 'name': 'a4', 'id': 216529},
{'level': 4,
'name': 'a4',
'id': 223884,
'_children': [{'level': 5,
'name': 'a5',
'id': 223885,
'_children': [{'level': 6,
'name': 'a6',
'id': 229186,
'_children': [{'level': 7, 'name': 'a7', 'id': 219062},
{'level': 7,
'name': 'a7',
'id': 222243,
'_children': [{'level': 8, 'name': 'a8', 'id': 219187},
{'level': 8, 'name': 'a8', 'id': 245985},
{'level': 8, 'name': 'a8', 'id': 232393},
{'level': 8, 'name': 'a8', 'id': 247138},
{'level': 8, 'name': 'a8', 'id': 228848},
{'level': 8, 'name': 'a8', 'id': 233920},
{'level': 8, 'name': 'a8', 'id': 228113},
{'level': 8, 'name': 'a8', 'id': 233767},
{'level': 8, 'name': 'a8', 'id': 235407},
{'level': 8, 'name': 'a8', 'id': 237757},
{'level': 8, 'name': 'a8', 'id': 159091},
{'level': 8, 'name': 'a8', 'id': 214832},
{'level': 8, 'name': 'a8', 'id': 253990},
{'level': 8, 'name': 'a8', 'id': 231610},
{'level': 8,
'name': 'a8',
'id': 182323,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
{'level': 8, 'name': 'a8', 'id': 242190},
{'level': 8,
'name': 'a8',
'id': 143580,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911},
{'level': 9, 'name': 'a9', 'id': 38958}]},
{'level': 8, 'name': 'a8', 'id': 242188},
{'level': 8, 'name': 'a8', 'id': 143581},
{'level': 8, 'name': 'a8', 'id': 242187},
{'level': 8, 'name': 'a8', 'id': 143582},
{'level': 8, 'name': 'a8', 'id': 242189},
{'level': 8, 'name': 'a8', 'id': 205877},
{'level': 8, 'name': 'a8', 'id': 242823},
{'level': 8, 'name': 'a8', 'id': 140979},
{'level': 8, 'name': 'a8', 'id': 237824},
{'level': 8, 'name': 'a8', 'id': 149933},
{'level': 8, 'name': 'a8', 'id': 153625},
{'level': 8, 'name': 'a8', 'id': 8392},
{'level': 8, 'name': 'a8', 'id': 162085},
{'level': 8, 'name': 'a8', 'id': 150691},
{'level': 8, 'name': 'a8', 'id': 147773},
{'level': 8,
'name': 'a8',
'id': 61070,
'_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
{'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
{'level': 2,
'name': 'a2',
'id': 214266,
'_children': [{'level': 3, 'name': 'a3', 'id': 216675},
{'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]
Update 3: handle columns parent_name
and child_name
You need to melt at once 4 columns parent_name
, child_name
, parent_id
, parent_id
. It is functionality of wide_to_long
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
df1 = df.rename(lambda x: '_'.join(x.split('_')[::-1]), axis=1)
df1 = pd.wide_to_long(df1.reset_index(), stubnames=['id', 'name'],
i='index', j='type', suffix='\w+', sep='_')
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.id, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0
nx.set_node_attributes(g, d)
out = [tree_data_custom(g, root_id)]
Out[91]:
[{'level': 0,
'name': 'word1',
'id': 125582,
'_children': [{'level': 1, 'name': 'word6', 'id': 214659},
{'level': 1, 'name': 'word7', 'id': 214633},
{'level': 1,
'name': 'word2',
'id': 214263,
'_children': [{'level': 2, 'name': 'word8', 'id': 131673},
{'level': 2, 'name': 'word9', 'id': 125579},
{'level': 2, 'name': 'word10', 'id': 125578},
{'level': 2,
'name': 'word4',
'id': 172670,
'_children': [{'level': 3, 'name': 'word13', 'id': 172669},
{'level': 3, 'name': 'word14', 'id': 174777},
{'level': 3,
'name': 'word5',
'id': 207661,
'_children': [{'level': 4, 'name': 'word15', 'id': 216529},
{'level': 4,
'name': 'word16',
'id': 223884,
'_children': [{'level': 5,
'name': 'word17',
'id': 223885,
'_children': [{'level': 6,
'name': 'word18',
'id': 229186,
'_children': [{'level': 7, 'name': 'word19', 'id': 219062},
{'level': 7, 'name': 'word20', 'id': 222243}]}]}]}]}]},
{'level': 2,
'name': 'word3',
'id': 214266,
'_children': [{'level': 3, 'name': 'word11', 'id': 216675},
{'level': 3, 'name': 'word12', 'id': 216671}]}]}]}]
Upvotes: 4
Reputation: 146
It's not a pandas one-liner, but here's a recursive function that should do the trick.
import json
import pandas as pd
df = pd.DataFrame(
[{"parent_id": 125582, "child_id": 214659, "level": 1}, {"parent_id": 125582, "child_id": 214633, "level": 1},
{"parent_id": 125582, "child_id": 214263, "level": 1}, {"parent_id": 214263, "child_id": 131673, "level": 2},
{"parent_id": 214263, "child_id": 125579, "level": 2}, {"parent_id": 214263, "child_id": 125578, "level": 2},
{"parent_id": 214263, "child_id": 172670, "level": 2}, {"parent_id": 214263, "child_id": 214266, "level": 2},
{"parent_id": 214266, "child_id": 216675, "level": 3}, {"parent_id": 214266, "child_id": 216671, "level": 3},
{"parent_id": 172670, "child_id": 172669, "level": 3}, {"parent_id": 172670, "child_id": 174777, "level": 3},
{"parent_id": 172670, "child_id": 207661, "level": 3}, {"parent_id": 207661, "child_id": 216529, "level": 4},
{"parent_id": 207661, "child_id": 223884, "level": 4}, {"parent_id": 223884, "child_id": 223885, "level": 5},
{"parent_id": 223885, "child_id": 229186, "level": 6}, {"parent_id": 229186, "child_id": 219062, "level": 7},
{"parent_id": 229186, "child_id": 222243, "level": 7}]
)
def recurse(parent_id, level):
# create the base result
result = {"id": int(parent_id), "level": int(level)}
# get all of the children of this parent, one level below this one
children = df[(df.parent_id == parent_id) & (df["level"] == level + 1)]
# if there are no such children, then return without a _children key
if children.empty:
return result
# otherwise, recurse on each child_id
result["_children"] = [recurse(child_id, level + 1) for child_id in sorted(children.child_id.unique())]
return result
# start at an implicit level 0
tree = [recurse(parent_id, 0) for parent_id in sorted(df[df["level"] == 1].parent_id.unique())]
# convert to JSON
print(json.dumps(tree, indent=3, sort_keys=True))
Upvotes: 2
Reputation: 11714
id_to_obj = {}
roots = []
def parse(row):
pid, cid, lvl = row.parent_id, row.child_id, row.level
if pid not in id_to_obj:
# parent not created yet: create parent as root
id_to_obj[pid] = {
'id': pid, 'level': 0
}
roots.append(pid)
# Get parent object
parent = id_to_obj[pid]
if not '_children' in parent:
parent['_children'] = []
# Create child object and add to parent's children list
child = {
'id': cid, 'level': lvl
}
parent['_children'].append(child)
id_to_obj[cid] = child
df.sort_values(by='level').apply(parse, axis=1)
# Result
[id_to_obj[pid] for pid in roots]
Upvotes: 2