Reputation: 1952
I'm having trouble figuring out how to update the networkx dag_find_longest_path() algorithm to return "N" for ties instead of returning the first max edge found, or returning a list of all edges that are tied for max weight.
I first created a DAG from pandas dataframe that contains an edgelist like the following subset:
edge1 edge2 weight
115252161:T 115252162:A 1.0
115252162:A 115252163:G 1.0
115252163:G 115252164:C 3.0
115252164:C 115252165:A 5.5
115252165:A 115252166:C 5.5
115252162:T 115252163:G 1.0
115252166:C 115252167:A 7.5
115252167:A 115252168:A 7.5
115252168:A 115252169:A 6.5
115252165:A 115252166:G 0.5
Then I use the following code to topologically sort the graph and then find the longest path according to the weights of the edges:
G = nx.from_pandas_edgelist(edge_df, source="edge1",
target="edge2",
edge_attr=['weight'],
create_using=nx.OrderedDiGraph())
longest_path = pd.DataFrame(nx.dag_longest_path(G))
This works great, except when there are ties for max weighted edge, it returns the first max edge found, and instead I would like it to just return an "N" representing "Null". So currently, the output is:
115252161 T
115252162 A
115252163 G
115252164 C
115252165 A
115252166 C
But what I actually need is:
115252161 T
115252162 N (or [A,T] )
115252163 G
115252164 C
115252165 A
115252166 C
The algorithm for finding the longest path is:
def dag_longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node, (length, _) = max(dist.items(), key=lambda x: x[1])
path = []
while length > 0:
path.append(node)
length, node = dist[node]
return list(reversed(path))
Copy-pastable definition of G
.
import pandas as pd
import networkx as nx
import numpy as np
edge_df = pd.read_csv(
pd.compat.StringIO(
"""edge1 edge2 weight
115252161:T 115252162:A 1.0
115252162:A 115252163:G 1.0
115252163:G 115252164:C 3.0
115252164:C 115252165:A 5.5
115252165:A 115252166:C 5.5
115252162:T 115252163:G 1.0
115252166:C 115252167:A 7.5
115252167:A 115252168:A 7.5
115252168:A 115252169:A 6.5
115252165:A 115252166:G 0.5"""
),
sep=r" +",
)
G = nx.from_pandas_edgelist(
edge_df,
source="edge1",
target="edge2",
edge_attr=["weight"],
create_using=nx.OrderedDiGraph(),
)
longest_path = pd.DataFrame(nx.dag_longest_path(G))
Upvotes: 2
Views: 1651
Reputation: 1952
I ended up just modeling the behavior in a defaultdict counter object.
from collections import defaultdict, Counter
I modified my edgelist to a tuple of (position, nucleotide, weight):
test = [(112,"A",23.0), (113, "T", 27), (112, "T", 12.0), (113, "A", 27), (112,"A", 1.0)]
Then used defaultdict(counter) to quickly sum occurences of each nucleotide at each position:
nucs = defaultdict(Counter)
for key, nuc, weight in test:
nucs[key][nuc] += weight
And then looped through the dictionary to pull out all nucleotides that equal the max value:
for key, nuc in nucs.items():
seq_list = []
max_nuc = []
max_val = max(nuc.values())
for x, y in nuc.items():
if y == max_val:
max_nuc.append(x)
if len(max_nuc) != 1:
max_nuc = "N"
else:
max_nuc = ''.join(max_nuc)
seq_list.append(max_nuc)
sequence = ''.join(seq_list)
This returns the final sequence for the nucleotide with the max value found, and returns N in the position of a tie:
TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT
However, the question bothered me, so I ended up using node attributes in networkx as a means to flag each node as being a tie or not. Now, when a node is returned in the longest path, I can then check the "tie" attribute and replace the node name with "N" if it has been flagged:
def get_path(self, edge_df):
G = nx.from_pandas_edgelist(
edge_df,
source="edge1",
target="edge2",
edge_attr=["weight"],
create_using=nx.OrderedDiGraph()
)
# flag all nodes as not having a tie
nx.set_node_attributes(G, "tie", False)
# check nodes for ties
for node in G.nodes:
# create list of all edges for each node
edges = G.in_edges(node, data=True)
# if there are multiple edges
if len(edges) > 1:
# find max weight
max_weight = max([edge[2]['weight'] for edge in edges])
tie_check = []
for edge in edges:
# pull out all edges that match max weight
if edge[2]["weight"] == max_weight:
tie_check.append(edge)
# check if there are ties
if len(tie_check) > 1:
for x in tie_check:
# flag node as being a tie
G.node[x[0]]["tie"] = True
# return longest path
longest_path = nx.dag_longest_path(G)
return longest_path
Upvotes: 1
Reputation: 11602
Judging by your example, each node is determined by position ID (number before :
) and two nodes with different bases attached are identical for the purposes of computing the path length. If this is correct, there is no need to modify the algorithm and you could get your result by manipulating vertex labels.
Basically, drop everything after semicolon in the edge_df
, compute the longest path and append the base labels from your original data.
edge_df_pos = pd.DataFrame(
{
"edge1": edge_df.edge1.str.partition(":")[0],
"edge2": edge_df.edge2.str.partition(":")[0],
"weight": edge_df.weight,
}
)
vert_labels = dict()
for col in ("edge1", "edge2"):
verts, lbls = edge_df[col].str.partition(":")[[0, 2]].values.T
for vert, lbl in zip(verts, lbls):
vert_labels.setdefault(vert, set()).add(lbl)
G_pos = nx.from_pandas_edgelist(
edge_df_pos,
source="edge1",
target="edge2",
edge_attr=["weight"],
create_using=nx.OrderedDiGraph(),
)
longest_path_pos = nx.dag_longest_path(G_pos)
longest_path_df = pd.DataFrame([[node, vert_labels[node]] for node in longest_path_pos])
# 0 1
# 0 115252161 {T}
# 1 115252162 {A, T}
# 2 115252163 {G}
# 3 115252164 {C}
# 4 115252165 {A}
# 5 115252166 {G, C}
# 6 115252167 {A}
# 7 115252168 {A}
# 8 115252169 {A}
If my interpretation is not correct, I doubt that there is a simple extension of the algorithm based on topological sort. The problem is that a graph can admit multiple topological sorts. If you print dist
as defined in the dag_longest_path
in your example, you'll get something like this:
{'115252161:T': (0, '115252161:T'),
'115252162:A': (1, '115252161:T'),
'115252162:T': (0, '115252162:T'),
'115252163:G': (2, '115252162:A'),
'115252164:C': (3, '115252163:G'),
'115252165:A': (4, '115252164:C'),
'115252166:C': (5, '115252165:A'),
'115252166:G': (5, '115252165:A'),
'115252167:A': (6, '115252166:C'),
'115252168:A': (7, '115252167:A'),
'115252169:A': (8, '115252168:A')}
Note that '115252162:T'
occurs in the third line and nowhere else. Hence, dist
cannot discriminate between your example and another example where '115252162:T'
occurs as a disjoint component. So it should not be possible to recover any paths through '115252162:T'
just using data in dist
.
Upvotes: 1
Reputation: 321
This line inside the function seems to discard the paths you want; because max
only returns one result:
node, (length, _) = max(dist.items(), key=lambda x: x[1])
I would keep the max value and then search based on it all the items. Then reuse the code to find the desired paths. An example would be like this:
def dag_longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
# store max value inside val variable
node, (length, val) = max(dist.items(), key=lambda x: x[1])
# find all dictionary items that have the maximum value
nodes = [(item[0], item[1][0]) for item in dist.items() if item[1][1] == val]
paths = []
# iterate over the different nodes and append the paths to a list
for node, length in nodes:
path = []
while length > 0:
path.append(node)
length, node = dist[node]
paths.append(list(reversed(path)))
return paths
PS. I haven't tested this code to know if it runs correctly.
Upvotes: 1