Reputation: 489

2-D Matrix: Finding and deleting columns that are subsets of other columns

I have a problem where I want to identify and remove columns in a logic matrix that are subsets of other columns. i.e. [1, 0, 1] is a subset of [1, 1, 1]; but neither of [1, 1, 0] and [0, 1, 1] are subsets of each other. I wrote out a quick piece of code that identifies the columns that are subsets, which does (n^2-n)/2 checks using a couple nested for loops.

import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
    for j in range(i+1,cols):
        diff = A[:,i]-A[:,j]
        if all(diff >= 0):
            print "%d is a subset of %d" % (j, i)
            columns[j] = False
        elif all(diff <= 0):
            print "%d is a subset of %d" % (i, j)
            columns[i] = False
B = A[:,columns]

The solution should be

>>> print B
[[1 0 0]
 [0 1 1]
 [1 1 0]
 [1 0 1]
 [1 0 1]
 [1 0 0]
 [0 1 1]
 [0 1 0]]

For massive matrices though, I'm sure there's a way that I could do this faster. One thought is to eliminate subset columns as I go so I'm not checking columns already known to be a subset. Another thought is to vectorize this so don't have O(n^2) operations. Thank you.

Upvotes: 6

Answers (3)

ToneDaBass

Reputation: 489

Since the A matrices I'm actually dealing with are 5000x5000 and sparse with about 4% density, I decided to try a sparse matrix approach combined with Python's "set" objects. Overall it's much faster than my original solution, but I feel like my process of going from matrix A to list of sets D is not as fast it could be. Any ideas on how to do this better are appreciated.

Solution

import numpy as np

A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])

rows,cols = A.shape
drops = np.zeros(cols).astype(bool)

# sparse nonzero elements
C = np.nonzero(A)

# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
    D[C[1][i]].add(C[0][i])

# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
    if drops[i]==True:
        continue
    col1 = D[i]
    for j in range(i+1,cols):
        col2 = D[j]
        if col2.issubset(col1):
            # I tried `if drops[j]==True: continue` here, but that was slower
            print "%d is a subset of %d" % (j, i)
            drops[j] = True
        elif col1.issubset(col2):
            print "%d is a subset of %d" % (i, j)
            drops[i] = True
            break

B = A[:, ~drops]
print B

Upvotes: 1

piRSquared

Reputation: 294498

Define subset as col1.dot(col1) == col1.dot(col2) if and only if col1 is a subset of col2

Define col1 and col2 are the same if and only if col1 is subset of col2 and vice versa.

I split the work into two. First get rid of all but one equivalent columns. Then remove subsets.

Solution

import numpy as np

def drop_duplicates(A):
    N = A.T.dot(A)
    D = np.diag(N)[:, None]
    drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
    return A[:, ~drops], drops

def drop_subsets(A):
    N = A.T.dot(A)
    drops = ((N == np.diag(N)).sum(axis=0) > 1)
    return A[:, ~drops], drops

def drop_strict(A):
    A1, d1 = drop_duplicates(A)
    A2, d2 = drop_subsets(A1)
    d1[~d1] = d2
    return A2, d1


A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])    

B, drops = drop_strict(A)

Demonstration

print B
print
print drops

[[1 0 0]
 [0 1 1]
 [1 1 0]
 [1 0 1]
 [1 0 1]
 [1 0 0]
 [0 1 1]
 [0 1 0]]

[False  True False False  True  True]

Explanation

N = A.T.dot(A) is a matrix of every combination of dot product. Per the definition of subset at the top, this will come in handy.

def drop_duplicates(A):
    N = A.T.dot(A)
    D = np.diag(N)[:, None]
    # (N == D)[i, j] being True identifies A[:, i] as a subset
    # of A[:, j] if i < j.  The relationship is reversed if j < i.
    # If A[:, j] is subset of A[:, i] and vice versa, then we have
    # equivalent columns.  Taking the lower triangle ensures we
    # leave one.
    drops = np.tril((N == D) & (N == D.T), -1).any(axis=1)
    return A[:, ~drops], drops

def drop_subsets(A):
    N = A.T.dot(A)
    # without concern for removing equivalent columns, this
    # removes any column that has an off diagonal equal to the diagonal
    drops = ((N == np.diag(N)).sum(axis=0) > 1)
    return A[:, ~drops], drops

Upvotes: 0

Divakar

Reputation: 221634

Here's another approach using NumPy broadcasting -

A[:,~((np.triu(((A[:,:,None] - A[:,None,:])>=0).all(0),1)).any(0))]

A detailed commented explanation is listed below -

# Perform elementwise subtractions keeping the alignment along the columns
sub = A[:,:,None] - A[:,None,:]

# Look for >=0 subtractions as they indicate non-subset criteria
mask3D = sub>=0

# Check if all elements along each column satisfy that criteria giving us a 2D
# mask which represent the relationship between all columns against each other
# for the non subset criteria
mask2D = mask3D.all(0)

# Finally get the valid column mask by checking for all columns in the 2D mas
# that have at least one element in a column san the diagonal elements.
# Index into input array with it for the final output.
colmask = ~(np.triu(mask2D,1).any(0))
out = A[:,colmask]

Upvotes: 0

2-D Matrix: Finding and deleting columns that are subsets of other columns

Answers (3)

Solution

Solution

Demonstration

Explanation

Related Questions