Reputation: 681
Considering a graph of nodes from networkx how can I apply a kmean cluster of all the nodes where specific nodes are considered the centroids of the clusters. In other words, assume we have this graph:
import networkx as nx
s = [0,3,2,3,4,5,1]
t = [1,2,7,4,6,6,5]
dist = [3,2,5,1,5,4,2]
G = nx.Graph()
for i in range(len(s)):
G.add_edge(s[i],t[i],weight=dist[i])
I want to apply a kmean clustering on the network where for example I choose the centroids to be 3 and 6 and the graph will be clustered accordingly to produce two subgraphs (or as many centroids as I input)
I have been looking at the kmean clustering here https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/ and what that doesn't cover is the inputted centroids but rather it only considers number of clusters without a centroid node.
Upvotes: 4
Views: 1663
Reputation: 10987
Note that you cannot directly apply k-means clustering to a network, as there does not necessarily exist a metric to measure distances between nodes and centroids. But...
.. provided you assume:
Under these assumptions the sum of distances to the centroids is minimal if you associate to each node the centroid with the shortest weighted shortest-path.
So the procedure could be:
This procedure corresponds loosely to the procedure of k-mean clustering, that is to minimize the within-cluster sum of squares (WCSS).
Although this procedure is similar to k-means clustering in data points in a metric-space, I would not call it k-means clustering. Especially because the position of the centroids is restricted to nodes in the network.
Here is how you could approach this with python:
Define the initial centroids:
centroids = [3, 6]
For each node, get all shortest paths to all centroids .
For example:
shortest_paths = [[(
cent,
nx.shortest_path(G,
source=n,
target=cent,
weight='weight')
) for cent in centroids] for n in G.nodes]
This gives (here they are reported together with the id of the centroid):
In [26]: shortest_paths
Out[26]:
[[(3, [0, 1, 5, 6, 4, 3]), (6, [0, 1, 5, 6])],
[(3, [1, 5, 6, 4, 3]), (6, [1, 5, 6])],
[(3, [3]), (6, [3, 4, 6])],
[(3, [2, 3]), (6, [2, 3, 4, 6])],
[(3, [7, 2, 3]), (6, [7, 2, 3, 4, 6])],
[(3, [4, 3]), (6, [4, 6])],
[(3, [6, 4, 3]), (6, [6])],
[(3, [5, 6, 4, 3]), (6, [5, 6])]]
Calculate the actual distance, i.e. sum the weights over the paths, for all shortest paths for all nodes:
For example:
distances = [[(
sp[0], # this is the id of the centroid
sum(
[G[sp[1][i]][sp[1][i+1]]['weight']
for i in range(len(sp[1]) - 1)]
) if len(sp[1]) > 1 else 0
) for sp in sps] for sps in shortest_paths]
So the distances are:
In [28]: distances
Out[28]:
[[(3, 15), (6, 9)],
[(3, 12), (6, 6)],
[(3, 0), (6, 6)],
[(3, 2), (6, 8)],
[(3, 7), (6, 13)],
[(3, 1), (6, 5)],
[(3, 6), (6, 0)],
[(3, 10), (6, 4)]]
Get the centroid with the minimal distance for all nodes:
For example:
closest_centroid = [
min(dist, key=lambda d: d[1])[0] for dist in distances
]
Leading to the grouping according to the centroids:
In [30]: closest_centroid
Out[30]: [6, 6, 3, 3, 3, 3, 6, 6]
Update the centroids as the current centroids might no longer be the actual centroids of the group:
Approach:
# for each group
# for each member of the group
# get the distance of shortest paths to all the other members of the group
# sum this distances
# find the node with the minimal summed distance > this is the new centroid of the group
Iteration: If the new centroids are not the same as the old one, use the new centroids and repeat steps 2.- 5.
Final step: If the new centroids found in step 5. are the same as the old ones or you've reached a iteration limit, associate the closest centroid to each node:
For example:
nodes = [n for n in G] # the actual id of the nodes
cent_dict = {nodes[i]: closest_centroid[i] for i in range(len(nodes))}
nx.set_node_attributes(G, cent_dict, 'centroid')
Or nx.set_node_attributes(G, 'centroid', cent_dict)
, if you are still at v1.x.
This would be an approach to do a sort-of k-means clustering for a network.
Upvotes: 6