python-3.xnetworkxdirected-acyclic-graphslongest-path

Reputation: 1952

NetworkX: Find longest path in DAG returning all ties for max

I'm having trouble figuring out how to update the networkx dag_find_longest_path() algorithm to return "N" for ties instead of returning the first max edge found, or returning a list of all edges that are tied for max weight.

I first created a DAG from pandas dataframe that contains an edgelist like the following subset:

edge1        edge2          weight
115252161:T  115252162:A     1.0
115252162:A  115252163:G     1.0
115252163:G  115252164:C     3.0
115252164:C  115252165:A     5.5
115252165:A  115252166:C     5.5
115252162:T  115252163:G     1.0
115252166:C  115252167:A     7.5
115252167:A  115252168:A     7.5
115252168:A  115252169:A     6.5
115252165:A  115252166:G     0.5

Then I use the following code to topologically sort the graph and then find the longest path according to the weights of the edges:

 G = nx.from_pandas_edgelist(edge_df, source="edge1", 
                                target="edge2", 
                                edge_attr=['weight'], 
                                create_using=nx.OrderedDiGraph())

longest_path = pd.DataFrame(nx.dag_longest_path(G))

This works great, except when there are ties for max weighted edge, it returns the first max edge found, and instead I would like it to just return an "N" representing "Null". So currently, the output is:

115252161   T
115252162   A
115252163   G
115252164   C
115252165   A
115252166   C

But what I actually need is:

115252161   T
115252162   N (or [A,T] )
115252163   G
115252164   C
115252165   A
115252166   C

The algorithm for finding the longest path is:

def dag_longest_path(G):

    dist = {}  # stores [node, distance] pair
    for node in nx.topological_sort(G):
        # pairs of dist,node for all incoming edges
        pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
        if pairs:
            dist[node] = max(pairs)
        else:
            dist[node] = (0, node)
    node, (length, _) = max(dist.items(), key=lambda x: x[1])
    path = []
    while length > 0:
        path.append(node)
        length, node = dist[node]
    return list(reversed(path))

Copy-pastable definition of G.

import pandas as pd
import networkx as nx
import numpy as np

edge_df = pd.read_csv(
    pd.compat.StringIO(
        """edge1        edge2          weight
115252161:T  115252162:A     1.0
115252162:A  115252163:G     1.0
115252163:G  115252164:C     3.0
115252164:C  115252165:A     5.5
115252165:A  115252166:C     5.5
115252162:T  115252163:G     1.0
115252166:C  115252167:A     7.5
115252167:A  115252168:A     7.5
115252168:A  115252169:A     6.5
115252165:A  115252166:G     0.5"""
    ),
    sep=r" +",
)


G = nx.from_pandas_edgelist(
    edge_df,
    source="edge1",
    target="edge2",
    edge_attr=["weight"],
    create_using=nx.OrderedDiGraph(),
)

longest_path = pd.DataFrame(nx.dag_longest_path(G))

Upvotes: 2

Answers (3)

SummerEla

Reputation: 1952

I ended up just modeling the behavior in a defaultdict counter object.

from collections import defaultdict, Counter

I modified my edgelist to a tuple of (position, nucleotide, weight):

test = [(112,"A",23.0), (113, "T", 27), (112, "T", 12.0), (113, "A", 27), (112,"A", 1.0)]

Then used defaultdict(counter) to quickly sum occurences of each nucleotide at each position:

nucs = defaultdict(Counter)

for key, nuc, weight in test:
    nucs[key][nuc] += weight

And then looped through the dictionary to pull out all nucleotides that equal the max value:

for key, nuc in nucs.items():
    seq_list = []
    max_nuc = []
    max_val = max(nuc.values())
    for x, y in nuc.items():
        if y == max_val:
            max_nuc.append(x)

    if len(max_nuc) != 1:
        max_nuc = "N"
    else:
        max_nuc = ''.join(max_nuc)

    seq_list.append(max_nuc)
    sequence = ''.join(seq_list)

This returns the final sequence for the nucleotide with the max value found, and returns N in the position of a tie:

TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT

However, the question bothered me, so I ended up using node attributes in networkx as a means to flag each node as being a tie or not. Now, when a node is returned in the longest path, I can then check the "tie" attribute and replace the node name with "N" if it has been flagged:

def get_path(self, edge_df):

    G = nx.from_pandas_edgelist(
        edge_df,
        source="edge1",
        target="edge2",
        edge_attr=["weight"],
        create_using=nx.OrderedDiGraph()
    )

    # flag all nodes as not having a tie
    nx.set_node_attributes(G, "tie", False)

    # check nodes for ties
    for node in G.nodes:

        # create list of all edges for each node
        edges = G.in_edges(node, data=True)

        # if there are multiple edges
        if len(edges) > 1:

            # find max weight
            max_weight = max([edge[2]['weight'] for edge in edges])

            tie_check = []
            for edge in edges:
                # pull out all edges that match max weight
                if edge[2]["weight"] == max_weight:
                    tie_check.append(edge)

            # check if there are ties       
            if len(tie_check) > 1:
                for x in tie_check:

                    # flag node as being a tie
                    G.node[x[0]]["tie"] = True

    # return longest path
    longest_path = nx.dag_longest_path(G)     

    return longest_path

Upvotes: 1

hilberts_drinking_problem

Reputation: 11602

Judging by your example, each node is determined by position ID (number before :) and two nodes with different bases attached are identical for the purposes of computing the path length. If this is correct, there is no need to modify the algorithm and you could get your result by manipulating vertex labels.

Basically, drop everything after semicolon in the edge_df, compute the longest path and append the base labels from your original data.

edge_df_pos = pd.DataFrame(
    {
        "edge1": edge_df.edge1.str.partition(":")[0],
        "edge2": edge_df.edge2.str.partition(":")[0],
        "weight": edge_df.weight,
    }
)

vert_labels = dict()
for col in ("edge1", "edge2"):
    verts, lbls = edge_df[col].str.partition(":")[[0, 2]].values.T
    for vert, lbl in zip(verts, lbls):
        vert_labels.setdefault(vert, set()).add(lbl)


G_pos = nx.from_pandas_edgelist(
    edge_df_pos,
    source="edge1",
    target="edge2",
    edge_attr=["weight"],
    create_using=nx.OrderedDiGraph(),
)

longest_path_pos = nx.dag_longest_path(G_pos)

longest_path_df = pd.DataFrame([[node, vert_labels[node]] for node in longest_path_pos])

Result

# 0       1
# 0  115252161     {T}
# 1  115252162  {A, T}
# 2  115252163     {G}
# 3  115252164     {C}
# 4  115252165     {A}
# 5  115252166  {G, C}
# 6  115252167     {A}
# 7  115252168     {A}
# 8  115252169     {A}

If my interpretation is not correct, I doubt that there is a simple extension of the algorithm based on topological sort. The problem is that a graph can admit multiple topological sorts. If you print dist as defined in the dag_longest_path in your example, you'll get something like this:

{'115252161:T': (0, '115252161:T'),
 '115252162:A': (1, '115252161:T'),
 '115252162:T': (0, '115252162:T'),
 '115252163:G': (2, '115252162:A'),
 '115252164:C': (3, '115252163:G'),
 '115252165:A': (4, '115252164:C'),
 '115252166:C': (5, '115252165:A'),
 '115252166:G': (5, '115252165:A'),
 '115252167:A': (6, '115252166:C'),
 '115252168:A': (7, '115252167:A'),
 '115252169:A': (8, '115252168:A')}

Note that '115252162:T' occurs in the third line and nowhere else. Hence, dist cannot discriminate between your example and another example where '115252162:T' occurs as a disjoint component. So it should not be possible to recover any paths through '115252162:T' just using data in dist.

Upvotes: 1

Dimitrios Mavrommatis

Reputation: 321

This line inside the function seems to discard the paths you want; because max only returns one result:

node, (length, _) = max(dist.items(), key=lambda x: x[1])

I would keep the max value and then search based on it all the items. Then reuse the code to find the desired paths. An example would be like this:

def dag_longest_path(G):
    dist = {}  # stores [node, distance] pair
    for node in nx.topological_sort(G):
        # pairs of dist,node for all incoming edges
        pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
        if pairs:
            dist[node] = max(pairs)
        else:
            dist[node] = (0, node)
    # store max value inside val variable
    node, (length, val) = max(dist.items(), key=lambda x: x[1])
    # find all dictionary items that have the maximum value
    nodes = [(item[0], item[1][0]) for item in dist.items() if item[1][1] == val]
    paths = []
    # iterate over the different nodes and append the paths to a list
    for node, length in nodes:
        path = []
        while length > 0:
            path.append(node)
            length, node = dist[node]
        paths.append(list(reversed(path)))
    return paths

PS. I haven't tested this code to know if it runs correctly.

Upvotes: 1

NetworkX: Find longest path in DAG returning all ties for max

Answers (3)

Result

Related Questions