Recursive function in pandas dataframe

Question

I have the following dataframe created by

import pandas as pd    
df = pd.DataFrame({'parent': ['AC1', 'AC2', 'AC3', 'AC1', 'AC11', 'AC5', 'AC5', 'AC6', 'AC8', 'AC9'],
                   'child': ['AC2', 'AC3', 'AC4', 'AC11', 'AC12', 'AC2', 'AC6', 'AC7', 'AC9', 'AC10']})

Which outputs the following:

    parent  child
0   AC1     AC2
1   AC2     AC3
2   AC3     AC4
3   AC1     AC11
4   AC11    AC12
5   AC5     AC2
6   AC5     AC6
7   AC6     AC7
8   AC8     AC9
9   AC9     AC10

I'd like to create a result dataframe where each of the parent (meaning it does not exist in the child column) is listed with the most final child(s).

df_result = pd.DataFrame({'parent': ['AC1', 'AC1', 'AC5', 'AC5', 'AC8', 'AC2'],
                     'child': ['AC4', 'AC12', 'AC4', 'AC7', 'AC10', 'AC4']})
    parent  child
0   AC1     AC4
1   AC1     AC12
2   AC5     AC4
3   AC5     AC7
4   AC8     AC10
5   AC2     AC4

I've started the following function but I'm not sure how to complete it.

def get_child(df):
result = {}
if df['parent'] not in df['child']:
    return result[df['parent']]

ggorlen · Accepted Answer

While your expected output seems somewhat incongruent with your description (AC2 seems like it shouldn't be considered a parent since it isn't a source node), I'm pretty confident that you want to run a traversal from each source node to locate all of its leaves. Doing this in the dataframe isn't convenient, so we can use df.values and create an adjacency list dictionary to represent the graph. I assume there are no cycles in the graph.

import pandas as pd
from collections import defaultdict

def find_leaves(graph, src):
    if src in graph:
        for neighbor in graph[src]:
            yield from find_leaves(graph, neighbor)
    else:
        yield src

def pair_sources_to_leaves(df):
    graph = defaultdict(list)
    children = set()

    for parent, child in df.values:
        graph[parent].append(child)
        children.add(child)

    leaves = [[x, list(find_leaves(graph, x))] 
               for x in graph if x not in children]
    return (pd.DataFrame(leaves, columns=df.columns)
              .explode(df.columns[-1])
              .reset_index(drop=True))

if __name__ == "__main__":
    df = pd.DataFrame({
        "parent": ["AC1", "AC2", "AC3", "AC1", "AC11", 
                   "AC5", "AC5", "AC6", "AC8", "AC9"],
        "child": ["AC2", "AC3", "AC4", "AC11", "AC12", 
                  "AC2", "AC6", "AC7", "AC9", "AC10"]
    })
    print(pair_sources_to_leaves(df))

Output:

  parent child
0    AC1   AC4
1    AC1  AC12
2    AC5   AC4
3    AC5   AC7
4    AC8  AC10

Recursive function in pandas dataframe

Answers (2)

Related Questions