Reputation: 3318
I have the following dataframe created by
import pandas as pd
df = pd.DataFrame({'parent': ['AC1', 'AC2', 'AC3', 'AC1', 'AC11', 'AC5', 'AC5', 'AC6', 'AC8', 'AC9'],
'child': ['AC2', 'AC3', 'AC4', 'AC11', 'AC12', 'AC2', 'AC6', 'AC7', 'AC9', 'AC10']})
Which outputs the following:
parent child
0 AC1 AC2
1 AC2 AC3
2 AC3 AC4
3 AC1 AC11
4 AC11 AC12
5 AC5 AC2
6 AC5 AC6
7 AC6 AC7
8 AC8 AC9
9 AC9 AC10
I'd like to create a result dataframe where each of the parent (meaning it does not exist in the child column) is listed with the most final child(s).
df_result = pd.DataFrame({'parent': ['AC1', 'AC1', 'AC5', 'AC5', 'AC8', 'AC2'],
'child': ['AC4', 'AC12', 'AC4', 'AC7', 'AC10', 'AC4']})
parent child
0 AC1 AC4
1 AC1 AC12
2 AC5 AC4
3 AC5 AC7
4 AC8 AC10
5 AC2 AC4
I've started the following function but I'm not sure how to complete it.
def get_child(df):
result = {}
if df['parent'] not in df['child']:
return result[df['parent']]
Upvotes: 0
Views: 1008
Reputation: 56885
While your expected output seems somewhat incongruent with your description (AC2 seems like it shouldn't be considered a parent since it isn't a source node), I'm pretty confident that you want to run a traversal from each source node to locate all of its leaves. Doing this in the dataframe isn't convenient, so we can use df.values
and create an adjacency list dictionary to represent the graph. I assume there are no cycles in the graph.
import pandas as pd
from collections import defaultdict
def find_leaves(graph, src):
if src in graph:
for neighbor in graph[src]:
yield from find_leaves(graph, neighbor)
else:
yield src
def pair_sources_to_leaves(df):
graph = defaultdict(list)
children = set()
for parent, child in df.values:
graph[parent].append(child)
children.add(child)
leaves = [[x, list(find_leaves(graph, x))]
for x in graph if x not in children]
return (pd.DataFrame(leaves, columns=df.columns)
.explode(df.columns[-1])
.reset_index(drop=True))
if __name__ == "__main__":
df = pd.DataFrame({
"parent": ["AC1", "AC2", "AC3", "AC1", "AC11",
"AC5", "AC5", "AC6", "AC8", "AC9"],
"child": ["AC2", "AC3", "AC4", "AC11", "AC12",
"AC2", "AC6", "AC7", "AC9", "AC10"]
})
print(pair_sources_to_leaves(df))
Output:
parent child
0 AC1 AC4
1 AC1 AC12
2 AC5 AC4
3 AC5 AC7
4 AC8 AC10
Upvotes: 0
Reputation: 77837
This is a tree structure, a particular type of graph. A data frame is not a particularly convenient way to represent a tree; I recommend that you switch to networkx
or some other graph-based package. Then look up how to do a simple path traversal; you'll find direct support in the graph package documentation.
If you insist on doing this yourself -- which is a reasonable programming exercise -- you just need something like this pseudo-code
for each parent not in "child" column:
here = parent
while here in parent column:
here = here["child"]
record (parent, here) pair
Upvotes: 1