marshall
marshall

Reputation: 2483

Find two pairs of pairs that sum to the same value

I have random 2d arrays which I make using

import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))

I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.

for pair in  combinations(combinations(range(n), 2), 2):
    if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
        print "Pair found", pair

A method that worked for n = 100 would be really great.

Upvotes: 10

Views: 763

Answers (5)

richsilv
richsilv

Reputation: 8013

Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:

def findMatchSets(A):
   B = A.transpose()
   pairs = tuple(combinations(range(len(A[0])), 2))
   matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
   for c in range(1, len(A[0])):
      matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
      matchSets = [block for block in matchSets if len(block) > 1]
      if not matchSets:
         return []
   return matchSets

This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).

UPDATE

Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:

def findMatchSets(A, n=2):
   B = A.transpose()
   pairs = tuple(combinations(range(len(A[0])), n))
   matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
   for c in range(1, len(A[0])):
      matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
      matchSets = [block for block in matchSets if len(block) > 1]
      if not matchSets:
      return []
   return matchSets

Upvotes: 4

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.

import numpy as np

n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]


def base_pack_lazy(a, base, dtype=np.uint64):
    """
    pack the last axis of an array as minimal base representation
    lazily yields packed columns of the original matrix
    """
    a = np.ascontiguousarray( np.rollaxis(a, -1))
    init = np.zeros(a.shape[1:], dtype)
    packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
    for columns in np.array_split(a, (len(a)-1)//packing+1):
        yield reduce(
            lambda acc,inc: acc*base+inc,
            columns,
            init)

def unique_count(a):
    """returns counts of unique elements"""
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)        #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
    return unique, count, inverse

def has_identical_row_sums_lazy(A, combinations_index):
    """
    compute the existence of combinations of rows summing to the same vector,
    given an nxm matrix A and an index matrix specifying all combinations

    naively, we need to compute the sum of each row combination at least once, giving n^3 computations
    however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
    all nicely vectorized of course
    """

    multiplicity, combinations = combinations_index.shape
    #list of indices into combinations_index, denoting possibly interacting combinations
    active_combinations = np.arange(combinations, dtype=np.uint32)

    for packed_column in base_pack_lazy(A, base=multiplicity+1):       #loop over packed cols
        #compute rowsums only for a fixed number of columns at a time.
        #this is O(n^2) rather than O(n^3), and after considering the first column,
        #we can typically already exclude almost all rowpairs
        partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
        #find duplicates in this column
        unique, count, inverse = unique_count(partial_rowsums)
        #prune those pairs which we can exclude as having different sums, based on columns inspected thus far
        active_combinations = active_combinations[count[inverse] > 1]
        #early exit; no pairs
        if len(active_combinations)==0:
            return False
    return True

def has_identical_triple_row_sums(A):
    n = len(A)
    idx = np.array( [(i,j,k)
        for i in xrange(n)
            for j in xrange(n)
                for k in xrange(n)
                    if i<j and j<k], dtype=np.uint16)
    idx = np.ascontiguousarray( idx.T)
    return has_identical_row_sums_lazy(A, idx)

def has_identical_double_row_sums(A):
    n = len(A)
    idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
    return has_identical_row_sums_lazy(A, idx)


from time import clock
t = clock()
for i in xrange(10):
    print has_identical_double_row_sums(A)
    print has_identical_triple_row_sums(A)
print clock()-t

Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s

Edit: some cleanup; edit2: some more cleanup

Upvotes: 1

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.

import numpy as np

n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)

def base3(a):
    """
    pack the last axis of an array in base 3
    40 base 3 numbers per uint64
    """
    S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
    R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
    for i in xrange(len(S)):
        s = S[i]
        r = R[...,i]
        for j in xrange(s.shape[-1]):
            r *= 3
            r += s[...,j]
    return R

def unique_count(a):
    """returns counts of unique elements"""
    unique, inverse = np.unique(a, return_inverse=True)
    count = np.zeros(len(unique), np.int)
    np.add.at(count, inverse, 1)
    return unique, count

def voidview(arr):
    """view the last axis of an array as a void object. can be used as a faster form of lexsort"""
    return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])

def has_pairs_of_pairs(A):
    #optional; convert rows to base 3
    A = base3(A)
    #precompute sums over a lower triangular set of all combinations
    rowsums = sum(A[I] for I in np.tril_indices(n,-1))
    #count the number of times each row occurs by sorting
    #note that this is not quite O(n log n), since the cost of handling each row is also a function of n
    unique, count = unique_count(voidview(rowsums))
    #print if any pairs of pairs exist;
    #computing their indices is left as an excercise for the reader
    return np.any(count>1)

from time import clock
t = clock()
for i in xrange(100):
    print has_pairs_of_pairs(A)
print clock()-t

Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing

Edit: added some timings; n=100 takes only 5ms per call on my i7m.

Upvotes: 4

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

Your current code does not test for pairs of rows that sum to the same value.

Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.

import numpy as np

n = 100
A = np.random.randint(2, size=(n,n))

rowsum = A.sum(axis=1)

unique, inverse = np.unique(rowsum, return_inverse = True)

count = np.zeros_like(unique)
np.add.at(count, inverse, 1)

for p in unique[count>1]:
    print p, np.nonzero(rowsum==p)[0]

Upvotes: 1

user545424
user545424

Reputation: 16179

If all you need to do is determine whether such a pair exists you can do:

exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

Upvotes: 0

Related Questions