How to find the min set of vertices in multiple potentially independent UNdirected Graphs that captures the largest total cost

Question

I have a large set of vertices/nodes that represent a set of graphs. Note that there could be many independent graphs within this complete set. The goal is to find the min number of vertices across all these graphs that correspond to the largest total sum of weights across all the edges captured by those selected vertices. I have the adjacency matrix in pandas and I am using networkx.

Below is a sample dataframe with the three columns where Number_Of_Trips is the weight. I could provide a weight of node = 10*trips in order to merge the two metrics together. I.e. maximizing # of Trips - 10*NumberOfNodes

    Number_Of_Trips dropoff_gh7 pickup_gh7
0   304 9tbqhsx 9tbqj4g
1   271 9tbqj4f 9tbqhsx
2   263 9tbqt4s 9tbqhsx
3   258 9tbqdye 9tbqdsr
4   256 9tbqhgh 9tbqjfv
5   236 9tbqhsw 9tbqj4g
6   233 9tbqt4g 9tbqv03
7   229 9tbqhsx 9tbqj4c
8   218 9tbqy3f 9tbqt4s
9   213 9tbq5v4 9tbqh41
10  210 9tbqhgh 9tbqhsw
11  192 9tbqhgh 9tbqje4
12  186 9tbqy3f 9tbqt4g
13  184 9tbqhgh 9tbqj4z
14  183 9tbqe3d 9tbqe9e
15  170 9tbq3xn 9tbq39w
16  167 9tbq5bw 9tbqht6
17  163 9tbqhsx 9tbqh0x
18  162 9tbqdk1 9tbq7p2
19  160 9tbqsch 9tbqt4s

x = nx.from_pandas_dataframe(df,"dropoff_gh7","pickup_gh7","Number_Of_Trips")
graphs = list(nx.connected_component_subgraphs(x))

SriK · Accepted Answer

Note that one caveat to the question is that you could have multiple independent subgraphs within the graph that could be the solution. The key intuition to this solution is that the most likely candidates for subgraphs are the vertices that share a lot of edges with each other. It turns out that this is precisely what is evaluated when looking at Cliques in a graph. Hence, this solution simply extracts all the cliques and then rank orders them by the total num of weights represented by the vertices in the clique - the number of vertices * cost of a vertice. This can be quickly prototyped using NetworkX.

G = nx.from_pandas_dataframe(df, "dropoff_gh7", "pickup_gh7", ['num_of_trips'])
# Find all the cliques in the graph (not only maximal but all sub cliques as well. Note that clique finding is NP complete so this may take a long time if your graph is > 100k of edges or more. For <100k edges, this took within 5 mins on a 16GB macbook pro 3GHz machine.
cliques = nx.find_cliques(G)
clique_trips = [np.array([c,G.subgraph(c).size(weight="num_of_trips")]) for c in cliques]
df_cliques = pd.DataFrame(clique_trips,columns=["vertices","num_of_trips"])
df_cliques["num_vertices"] = df_cliques.apply(lambda x:len(x[0]), axis=1)
df_cliques["weighted_trips"] = df_cliques.apply(lambda row: 
    row["num_of_trips"] - row["num_vertices"]*COST_PER_NODE, axis=1)
df_cliques = df_cliques.sort_values("weighted_trips")[::-1]
df_cliques.head()
# The top N cliques can then be aggregated into a set to identify the precise vertices that are most valuable.

How to find the min set of vertices in multiple potentially independent UNdirected Graphs that captures the largest total cost

Answers (2)

Related Questions