Reputation: 1
I want to find all the parents of a particular child from a dataframe. My current code takes more than 20 seconds to compile for a 3000 datapoint dataset. I figured it is because of the recursive function calls and loops I have used. Can you help me optimise the program?
I've tried to search for the parent of the child node, print it and assume it as a child. Then recursively find it's parent and so on until all parents are found exhaustively.
df = pd.DataFrame(
{
'parent_name':
["Car","Tyre","Tyre","Rubber","Nylon","Nylon","Trees","Trees"],
'child_name': ["Tyre","Rubber","Nylon","Trees","Chemicals","Man-made","Leaves","Stems"]
}
)
def get_parent_list(node_id):
list_of_parents = []
#define a function to find parent_names for all child_names
def find_parent(node_id):
parent_names = df.loc[df["child_name"].isin([node_id]),"parent_name"]
for parent_name in parent_names:
list_of_parents.append(parent_name)
find_parent(parent_name)
find_parent(node_id)
return list_of_parents
df["list_of_parents"] = df["child_name"].apply(get_parent_list)
OutPut expected :
if user gives : "Trees" as input
output : Trees : Rubber, tyre, car
Upvotes: 0
Views: 375
Reputation: 1143
Most natural here is to use a tree data structure, which will have linear query time. Though I'm surprised your approach was that slow, as 3000 datapoints is not huge.
import pandas as pd
from treelib import Tree
df = pd.DataFrame(
{
"parent_name":
["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees"],
"child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems"]
}
)
tree = Tree()
tree.create_node(df["parent_name"][0], df["parent_name"][0]) # root
for i, row in df.iterrows():
tree.create_node(row["child_name"], row["child_name"], parent=row["parent_name"])
tree.show()
def find_parents(child_name):
child = tree[child_name]
parent_names = []
while child.bpointer is not None:
parent = tree[child.bpointer]
parent_names.append(parent.identifier)
child = parent
return parent_names
print(find_parents("Trees"))
df["list_of_parents"] = df["child_name"].apply(find_parents)
Note : if you modify the dataframe you'll have to recreate the tree before calling the "find_parents" function again. If you modify the dataframe regularly you may choose to recreate the tree inside the find_parents function.
EDIT : Hi @AkshayKannan , sorry for the late response. Since some nodes may have multiple parents, the proper structure to use here is not a Tree but a Directed Acyclic Graph (DAG). The following should work (I added a row ("Nylon","Leaves") to test the multiple parent case)
import pandas as pd
import networkx as nx
df = pd.DataFrame(
{
"parent_name":
["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees", "Nylon"],
"child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems", "Leaves"]
}
)
G = nx.DiGraph()
for i, row in df.iterrows():
G.add_edge(row["child_name"], row["parent_name"])
nx.draw(G, with_labels=True)
def find_parents(child_name):
return list(nx.descendants(G, child_name))
print(find_parents("Car"))
print(find_parents("Chemicals"))
print(find_parents("Leaves"))
Upvotes: 1