Bernardo
Bernardo

Reputation: 3318

Recursive function in pandas dataframe

I have the following dataframe created by

import pandas as pd    
df = pd.DataFrame({'parent': ['AC1', 'AC2', 'AC3', 'AC1', 'AC11', 'AC5', 'AC5', 'AC6', 'AC8', 'AC9'],
                   'child': ['AC2', 'AC3', 'AC4', 'AC11', 'AC12', 'AC2', 'AC6', 'AC7', 'AC9', 'AC10']})

Which outputs the following:

    parent  child
0   AC1     AC2
1   AC2     AC3
2   AC3     AC4
3   AC1     AC11
4   AC11    AC12
5   AC5     AC2
6   AC5     AC6
7   AC6     AC7
8   AC8     AC9
9   AC9     AC10

I'd like to create a result dataframe where each of the parent (meaning it does not exist in the child column) is listed with the most final child(s).

df_result = pd.DataFrame({'parent': ['AC1', 'AC1', 'AC5', 'AC5', 'AC8', 'AC2'],
                     'child': ['AC4', 'AC12', 'AC4', 'AC7', 'AC10', 'AC4']})
    parent  child
0   AC1     AC4
1   AC1     AC12
2   AC5     AC4
3   AC5     AC7
4   AC8     AC10
5   AC2     AC4

I've started the following function but I'm not sure how to complete it.

def get_child(df):
result = {}
if df['parent'] not in df['child']:
    return result[df['parent']]

Upvotes: 0

Views: 1008

Answers (2)

ggorlen
ggorlen

Reputation: 56885

While your expected output seems somewhat incongruent with your description (AC2 seems like it shouldn't be considered a parent since it isn't a source node), I'm pretty confident that you want to run a traversal from each source node to locate all of its leaves. Doing this in the dataframe isn't convenient, so we can use df.values and create an adjacency list dictionary to represent the graph. I assume there are no cycles in the graph.

import pandas as pd
from collections import defaultdict

def find_leaves(graph, src):
    if src in graph:
        for neighbor in graph[src]:
            yield from find_leaves(graph, neighbor)
    else:
        yield src

def pair_sources_to_leaves(df):
    graph = defaultdict(list)
    children = set()

    for parent, child in df.values:
        graph[parent].append(child)
        children.add(child)

    leaves = [[x, list(find_leaves(graph, x))] 
               for x in graph if x not in children]
    return (pd.DataFrame(leaves, columns=df.columns)
              .explode(df.columns[-1])
              .reset_index(drop=True))

if __name__ == "__main__":
    df = pd.DataFrame({
        "parent": ["AC1", "AC2", "AC3", "AC1", "AC11", 
                   "AC5", "AC5", "AC6", "AC8", "AC9"],
        "child": ["AC2", "AC3", "AC4", "AC11", "AC12", 
                  "AC2", "AC6", "AC7", "AC9", "AC10"]
    })
    print(pair_sources_to_leaves(df))

Output:

  parent child
0    AC1   AC4
1    AC1  AC12
2    AC5   AC4
3    AC5   AC7
4    AC8  AC10

Upvotes: 0

Prune
Prune

Reputation: 77837

This is a tree structure, a particular type of graph. A data frame is not a particularly convenient way to represent a tree; I recommend that you switch to networkx or some other graph-based package. Then look up how to do a simple path traversal; you'll find direct support in the graph package documentation.

If you insist on doing this yourself -- which is a reasonable programming exercise -- you just need something like this pseudo-code

for each parent not in "child" column:
    here = parent
    while here in parent column:
        here = here["child"]

    record (parent, here) pair

Upvotes: 1

Related Questions