yathrakaaran
yathrakaaran

Reputation: 189

How to find the connected component containing max number of given set of nodes

I have a graph containing many genes and their interactions. I am interested in finding a subgraph with the maximum no .of a specific set of genes say, A, B, C, D, E in the graph.

Tried BFS algorithm and also connected components. But do not know how to find the subgraph of my genes of interest.

def bfs(G, gene, n):
"""

Using breadth-first search
returns a graph of breadth n starting at source gene.
"""
S = nx.algorithms.traversal.breadth_first_search.bfs_tree(G, source=gene, depth_limit=n)
return S

Given a graph G(V,E) with V vertices and E edges , I want to find a subgraph G'(v,e) where v is a subset of V, such that G' contains maximum of my nodes on interest.

Upvotes: 2

Views: 1989

Answers (1)

Joel
Joel

Reputation: 23827

edit While I think my original code (now at the bottom) was good, I think you can do better using node_connected_component(G,u) which returns the set of nodes in the same component as u.

Let me explain the code below. First, I'm going to step through the interesting genes. With the first one, I look for the component of G in which it sits, and then find all of the other interesting genes that are in the same component. Then when I look at each subsequent interesting gene, I make sure I haven't already encountered it in the same component as another. If it's a new component, I find all of the other interesting genes in the same component. At the end, I find the component that had the most interesting genes.

component_count = {}
seen = {}
for source in interesting_genes:
    if source not in seen:
        reachable_nodes = nx.node_connected_component(G, source)
        reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes]
        for node in reachable_nodes_of_interest:
            seen[node] = True
        component_count = len(reachable_nodes_of_interest)

source = max(component_count, key=component_count.get) #finds the node with the largest component_count


Gprime = G.subgraph(nx.node_connected_component(G, source))

Take a look at the documentation for single_source_shortest_path_length.

seen = {source:False for source in interesting_genes}  #if we've found the component of a node, no need to recalculate it.
component_count = {}  #will count how many other interesting genes there are in a component
for source in interesting_genes:
    if not seen[source]:
        reachable_nodes_dict = nx.single_source_shortest_path_length(G, node)
        reachable_nodes_of_interest = [target for target in interesting_genes if target in reachable_nodes_dict]
        for target in reachable_nodes_of_interest:
            seen[target] = True
        component_count[source] = len(reachable_nodes_of_interest)
source = max(component_count, key=component_count.get) #finds the node with the largest component_count


Gprime = G.subgraph(nx.node_connected_component(G, source))

Upvotes: 3

Related Questions