Akshay Kannan
Akshay Kannan

Reputation: 1

How to optimize recursive function calls and inner loops in pandas?

I want to find all the parents of a particular child from a dataframe. My current code takes more than 20 seconds to compile for a 3000 datapoint dataset. I figured it is because of the recursive function calls and loops I have used. Can you help me optimise the program?

I've tried to search for the parent of the child node, print it and assume it as a child. Then recursively find it's parent and so on until all parents are found exhaustively.

df = pd.DataFrame(
    {
        'parent_name': 
    ["Car","Tyre","Tyre","Rubber","Nylon","Nylon","Trees","Trees"],
    'child_name': ["Tyre","Rubber","Nylon","Trees","Chemicals","Man-made","Leaves","Stems"]
    }
)

Define a function using all these to find all parent nodes

def get_parent_list(node_id):

    list_of_parents = []  

#define a function to find parent_names for all child_names   
    def find_parent(node_id):

       parent_names = df.loc[df["child_name"].isin([node_id]),"parent_name"]

       for parent_name in parent_names:
          list_of_parents.append(parent_name)
          find_parent(parent_name)

       find_parent(node_id)
       return list_of_parents

  df["list_of_parents"] = df["child_name"].apply(get_parent_list)

I would store the output received as a separate column on the dataframe

After this I would just do a search in the dataframe for user input and display the corresponding list of parents column as output

OutPut expected :

if user gives : "Trees" as input

output : Trees : Rubber, tyre, car

Upvotes: 0

Views: 375

Answers (1)

Bruno Degomme
Bruno Degomme

Reputation: 1143

Most natural here is to use a tree data structure, which will have linear query time. Though I'm surprised your approach was that slow, as 3000 datapoints is not huge.

import pandas as pd
from treelib import Tree

df = pd.DataFrame(
    {
        "parent_name":
            ["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees"],
        "child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems"]
    }
)

tree = Tree()
tree.create_node(df["parent_name"][0], df["parent_name"][0])  # root
for i, row in df.iterrows():
    tree.create_node(row["child_name"], row["child_name"], parent=row["parent_name"])
tree.show()

def find_parents(child_name):
    child = tree[child_name]
    parent_names = []
    while child.bpointer is not None:
        parent = tree[child.bpointer]
        parent_names.append(parent.identifier)
        child = parent

    return parent_names


print(find_parents("Trees"))
df["list_of_parents"] = df["child_name"].apply(find_parents)

Note : if you modify the dataframe you'll have to recreate the tree before calling the "find_parents" function again. If you modify the dataframe regularly you may choose to recreate the tree inside the find_parents function.

EDIT : Hi @AkshayKannan , sorry for the late response. Since some nodes may have multiple parents, the proper structure to use here is not a Tree but a Directed Acyclic Graph (DAG). The following should work (I added a row ("Nylon","Leaves") to test the multiple parent case)

import pandas as pd
import networkx as nx

df = pd.DataFrame(
    {
        "parent_name":
            ["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees", "Nylon"],
        "child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems", "Leaves"]
    }
)

G = nx.DiGraph()
for i, row in df.iterrows():
    G.add_edge(row["child_name"], row["parent_name"])

nx.draw(G, with_labels=True)


def find_parents(child_name):
    return list(nx.descendants(G, child_name))


print(find_parents("Car"))
print(find_parents("Chemicals"))
print(find_parents("Leaves"))

Upvotes: 1

Related Questions