acdr
acdr

Reputation: 4726

Check if two numpy arrays are identical

Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).

However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.

I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))

However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.

Is there a way to do what I want in a more speedy manner?

I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.

Upvotes: 13

Views: 16622

Answers (8)

Alexander
Alexander

Reputation: 173

Breaking down the original problem to three parts: "(1) My arrays are really large, and (2) I have a lot of them, and (3) the probability of two arrays being equal is small"

All the solutions (to date) are focused on part (1) - optimizing the performance of each equality check, and some improve this performance by factor of 10. Points (2) and (3) are ignored. Comparing each pair has O(n^2) complexity, which can become huge for a lot of matrices, while needles as the probability of being duplicates is very small.

The check can become much faster with the following general algorithm -

  • fast hash of each array O(n)
  • check equality only for arrays with the same hash

A good hash is almost unique, so the number of keys can easily be a very large fraction of n. On average, number of arrays with the same hash will be very small, and almost 1 in some cases. Duplicate arrays will have the same hash, while having the same hash doesn't guarantee they are duplicates. In that sense, the algorithm will catch all the duplicates. Comparing images only with the same hash significantly reduces the number of comparisons, which becomes almost O(n)

For my problem, I had to check duplicates within ~1 million integer arrays, each with 10k elements. Optimizing only the array equality check (with @MB-F solution) estimated run time was 5 days. With hashing first it finished in minutes. (I used array sum as the hash, that was suited for my arrays characteristics)

Some psuedo-python code


def fast_hash(arr) -> int:
    pass

def arrays_equal(arr1, arr2) -> bool:
    pass

def make_hash_dict(array_stack, hush_fn=np.sum):

    hash_dict = defaultdict(list)
    hashes = np.squeeze(np.apply_over_axes(hush_fn, array_stack, range(1, array_stack.ndim)))
    for idx, hash_val in enumerate(hashes):
        hash_dict[hash_val].append(idx)

    return hash_dict

def get_duplicate_sets(hash_dict, array_stack):

    duplicate_sets = []
    for hash_key, ind_list in hash_dict.items():
        if len(ind_list) == 1:
            continue

        all_duplicates = []
        for idx1 in range(len(ind_list)):
            v1 = ind_list[idx1]
            if v1 in all_duplicates:
                continue

            arr1 = array_stack[v1]
            curr_duplicates = []
            for idx2 in range(idx1+1, len(ind_list)):
                v2 = ind_list[idx2]
                arr2 = array_stack[v2]
                if arrays_equal(arr1, arr2):
                    if len(curr_duplicates) == 0:
                        curr_duplicates.append(v1)
                    curr_duplicates.append(v2)
            
            if len(curr_duplicates) > 0:
                all_duplicates.extend(curr_duplicates)
                duplicate_sets.append(curr_duplicates)

    return duplicate_sets


The variable duplicate_sets is a list of lists, each internal list contains indices of all the same duplicates.

Upvotes: 1

auto-dan
auto-dan

Reputation: 9

As Thomas Kühn wrote in a comment to your post, array_equal is a function which should solve the problem. It is described in Numpy's API reference.

Upvotes: 0

matanox
matanox

Reputation: 13746

Well, not really an answer as I haven't checked if it break-circuits, but:

assert_array_equal.

From the documentation:

Raises an AssertionError if two array_like objects are not equal.

Try Except it if not on a performance sensitive code path.

Or follow the underlying source code, maybe it's efficient.

Upvotes: 1

Śmigło
Śmigło

Reputation: 1047

Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:

def compare(a, b):
    if len(a) > 0 and not np.array_equal(a[0], b[0]):
        return False
    if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
        return False
    if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
        return False
    return np.array_equal(a, b)

:)

Upvotes: 2

Daniel F
Daniel F

Reputation: 14399

Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True

%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop

If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.

I think.

EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)

a=np.random.randint(0,10,(100,100))

b=a.copy()

%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop

%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop

np.all(a==b)
Out[]: True

memoryview(a.data)==memoryview(b.data)
Out[]: True

Upvotes: 1

MB-F
MB-F

Reputation: 23647

Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:

import numpy as np
import numba as nb


@nb.jit(nopython=True)
def arrays_equal(a, b):
    if a.shape != b.shape:
        return False
    for ai, bi in zip(a.flat, b.flat):
        if ai != bi:
            return False
    return True


a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)


%timeit np.all(a==b)  # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a)  # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b)  # 100000 loops, best of 3: 691 ns per loop

Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.

Upvotes: 13

Anoroah
Anoroah

Reputation: 2119

You could iterate all elements of the arrays and check if they are equal. If the arrays are most likely not equal it will return much faster than the .all function. Something like this:

import numpy as np

a = np.array([1, 2, 3])
b = np.array([1, 3, 4])

areEqual = True

for x in range(0, a.size-1):
        if a[x] != b[x]:
                areEqual = False
                break
        else:
               print "a[x] is equal to b[x]\n"

if areEqual:
        print "The tables are equal\n"
else:
        print "The tables are not equal\n"

Upvotes: 0

acdr
acdr

Reputation: 4726

Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.

Upvotes: 1

Related Questions