Michael
Michael

Reputation: 1767

Point-cloud cluster-analysis in Python - identifying clusters from binary matrix

As an example, I have the following input data (the point-cloud I am working with is more complex)

Data = [1,1,1,1,1],[1,1,2,1,1],[2,2,2,1,1],[3,3,3,1,1],[4,4,4,1,1],[5,5,5,1,1],[50,50,50,1,1],[95,95,95,1,1],[96,96,96,1,1],[97,97,97,1,1],[98,98,98,1,1],[99,99,99,1,1],[2,2,3,1,1],[2,2,1,1,1],[2,2,4,1,1]

A clustering algorithm gives a binary upper triangular matrix (lets call it connection matrix). A 1 means that two points are connected. E.g. Point ID 0 (row 0) is connected to itself (column 0), and 1,2,3,12,13,14. But Points 4 and 5 are also reached via 3, 12, 13, and 14.

[ 1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]

I can identify the clusters per row with rowclustering(s), where s is the binary matrix from above.

def rowclustering(s):
    r = 0
    idx = []
    while r < size(s,0):
        row = []
        for i in range(size(s,1)):
            if s[r][i] == 1:
                row = row + [i]
        r = r + 1
        idx = idx + [row]
    return idx

And the returned idx is:

idx = [[0, 1, 2, 3, 12, 13, 14], [1, 2, 3, 12, 13, 14], [2, 3, 4, 12, 13, 14], [3, 4, 5, 12, 13, 14], [4, 5, 12, 14], [5], [6], [7, 8, 9], [8, 9, 10], [9, 10, 11], [10, 11], [11], [12, 13, 14], [13, 14], [14]]

Now, obviously there are fewer clusters than 15, because some of the rows are connected via a common ID (e.g. look at ID 4 and 5). What I want to have is:

result = [[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]

I have tried to create a function (clustering(idx,f) ) that does that, where idx, is the result of rowclustering(s) and f would be the first row in idx, e.g. [0, 1, 2, 3, 12, 13, 14]. However, this function will not finish properly. What would be the proper code to break the function after all connections (of idx IDs) are made?

def clustering(idx,f):
    for i in f:
        f = f + idx[i]
    f = list(set(f))
    clustering(idx,f)

    return

The problem that I am trying to solve is a, sort of, self growing procedure. The function clustering should call itself until all possible point connections are made. This could be done on the idx or (maybe better) on the connection matrix (matrix reduction?).

Any help is much appreciated! Let me know if I should clarify my question. Thanks.

Upvotes: 1

Views: 1424

Answers (1)

Erotemic
Erotemic

Reputation: 5228

Your problem can be viewed as finding connected components. You can use networkx to get a solution, or you could implement BFS (breadth first search) yourself.

import networkx as nx
import numpy as np
x = """
[ 1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
[ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]
"""
mat = eval('[' + x.replace('.', '.,').replace(']', '],') + ']')
mat = np.array(mat)
G = nx.from_numpy_matrix(np.array(mat))
print(list(nx.connected_components(G)))
[{0, 1, 2, 3, 4, 5, 12, 13, 14}, {6}, {7, 8, 9, 10, 11}]

EDIT:

Actually, something about this problem made me remember something I read ahile ago. This can actually be computed using only matrix operations. Its very nifty. Your initial matrix is the adjacency matrix (A), and we need to also specify a degree matrix (D) that holds the degree of each node on the diagonal. We can use these to define the Laplacian matrix (L) and then use a bit of spectral graph theory. (yay!)

# Make the undirected version of the graph (no self loops)
A = (mat + mat.T) * (1 - np.eye(mat.shape[0]))
# Make the degree matrix
D = np.diag(A.sum(axis=1) + A.sum(axis=0)) / 2
# thats all we need to define the laplacian
L = D - A

# The number of zeros eigenvalues of the Laplacian is exactly the number of CCs
np.isclose(np.linalg.eigvals(L), 0).sum()

3 

# The connected compoments themselves are identified by rows that have the same nullspace vector
u, s, vh = np.linalg.svd(L)
ns = vh[(s >= 1e-13).sum():].conj().T

array([[-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.19222441,  0.97663659,  0.09607676],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.01778075, -0.04721352,  0.44435878],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ],
   [-0.32684842, -0.06239247, -0.0197079 ]])

Now, we've computed the answer! Its just a little strange to interpret. A little bit of processing can transform this into the representation you want.

# the following is a little numpy trick to find unique rows 
# chopping off the last few decimal places to account for numerical errors
ns_ = np.ascontiguousarray(np.round(ns, 8)).view(np.dtype((np.void, ns.dtype.itemsize * ns.shape[1])))
ns_basis, row_to_cc_id = np.unique(ns_, return_inverse=True)
# Finally we can just use this to convert to the desired output format
groups = [[] for _ in range(len(ns_basis))]
for row, id in enumerate(row_to_cc_id):
    groups[id].append(row)
print(groups)
[[0, 1, 2, 3, 4, 5, 12, 13, 14], [6], [7, 8, 9, 10, 11]]

Upvotes: 3

Related Questions