Reputation: 8467
I'm looking for an efficient algorithm to identify a block structure in a matrix with many 0 entries.
For example, the 6×7 matrix
0.0975 0.9575 0 0 0 0 0
0.2785 0.9649 0 0 0 0 0
0.5469 0.1576 0 0 0 0 0
0 0 0.9706 0.9572 0 0 0
0 0 0 0 0.8235 0.3171 0.0344
0 0 0 0 0.6948 0.9502 0.4387
consists of three blocks of sizes 3×2, 1×2, and 2×3, respectively.
A block is defined by a set of rows and a set of columns. A block structure is characterized by the fact that all entries that do not belong to a block are 0 exactly. However, there may be exact-0 entries also within the blocks.
A trivial solution is to always declare the whole matrix a block; therefore, a solution is sought such that the number of within-block entries is as small as possible.
To make things harder (or maybe easier?), the blocks do not have to be contiguous. A permuted version of the above matrix,
0 0.9572 0 0 0 0 0.9706
0 0 0.0975 0 0 0.9575 0
0.4387 0 0 0.9502 0.6948 0 0
0.0344 0 0 0.3171 0.8235 0 0
0 0 0.2785 0 0 0.9649 0
0 0 0.5469 0 0 0.1576 0
therefore also has a three-block structure, which can be described as:
Solutions I have thought of are:
Use a connection-weight-based cluster algorithm. However, the matrix does not have to be symmetric or even square. There is no correspondence between a specific row and a specific column.
Initially define a block to consist of one (non-0) entry (described by its row and its column), look for non-0 entries in its row and in its column and add the respective columns and rows, grow like that iteratively until no rows or columns are added; that identifies one block. Do the same starting from an entry that is not contained in the block. Repeat until no non-0 entries are left. Here I doubt that this algorithm efficiently scales to a large matrix with many blocks.
I'm looking for an algorithm, or other ideas for an algorithm, not for an implementation. However, an implementation e.g. in Matlab or Python would be welcome.
Upvotes: 2
Views: 323
Reputation: 77454
This is a standard scenario in general expression analysis.
The algorithms for this are known as biclustering (because they cluster rows and columns at the same time). An early method is due to Cheng and Church.
Upvotes: 2