Fast way to find matching elements in multiple numpy arrays

Question

I am trying to find matching elements in a large number of large numpy arrays. The way I am doing this currently (holding the arrays in a dictionary and looping over them several times) is very slow, and am looking for a faster way to tackle the problem.

Here is a self-contained section of code that presents the problem:

import numpy as np

data = {}
for frame in xrange(100):
    data[frame] = np.random.randint(100, size=(100, 3))
# data structure is 100 'pages' each containing an array of 100 elements
# trying to find matching elements from arrays on different pages

for page in data:
    for point in data[page]:
        for page_to_match in data:
            if page_to_match == page: # exclude matches on the same page
                pass
            else:
                for point_to_match in data[page_to_match]:
                    if np.array_equal(point, point_to_match):
                        print 'Found a match -', point, '- pages:', page, page_to_match
                        # should find about 0 - 3 matches per page

As you can see it works but very poorly.

Edit: here is a minimal version of the code. It works quickly for small arrays like this but slowly for large arrays like above. Replace the first three lines in the above section with the following:

data = {}
data[0] = np.array([[1,2],[3,4]])
data[1] = np.array([[5,6],[7,8]])
data[2] = np.array([[3,4],[5,8]])

juanpa.arrivillaga · Accepted Answer

If memory isn't an issue, I would solve the problem like this:

>>> data = {}
>>> for frame in range(100):
...     data[frame] = np.random.randint(100, size=(100, 3))
...
>>> from collections import defaultdict
>>> grouper = defaultdict(list)
>>> for k, v in data.items():
...     for row in v:
...         grouper[tuple(row)].append(k)
...
>>> for k, v in grouper.items():
...     if len(v) > 1:
...         print("Duplicate row", np.array(k), "found in pages:")
...         print(*v)
...

Results:

Duplicate row [89 50 83] found in pages:
13 96
Duplicate row [76 88 29] found in pages:
32 56
Duplicate row [74 26 81] found in pages:
11 99
Duplicate row [19  4 53] found in pages:
17 44
Duplicate row [56 20  4] found in pages:
41 91
Duplicate row [92 62 50] found in pages:
6 45
Duplicate row [86  9 39] found in pages:
12 62
Duplicate row [64 47 84] found in pages:
5 51
Duplicate row [52 74 44] found in pages:
9 19
Duplicate row [60 21 39] found in pages:
14 47
Duplicate row [80 42 33] found in pages:
65 82
Duplicate row [ 4 63 85] found in pages:
8 24
Duplicate row [70 84 35] found in pages:
42 69
Duplicate row [54 14 31] found in pages:
43 47
Duplicate row [38 81 38] found in pages:
51 67
Duplicate row [55 59 10] found in pages:
29 54
Duplicate row [84 77 37] found in pages:
51 53
Duplicate row [76 27 54] found in pages:
33 39
Duplicate row [52 64 20] found in pages:
1 37
Duplicate row [65 97 45] found in pages:
61 80
Duplicate row [69 52  8] found in pages:
60 85
Duplicate row [51  2 37] found in pages:
1 52
Duplicate row [31 36 53] found in pages:
50 84
Duplicate row [24 57  1] found in pages:
34 88
Duplicate row [87 89  0] found in pages:
11 78
Duplicate row [94 38 17] found in pages:
40 89
Duplicate row [46 25 46] found in pages:
54 87
Duplicate row [34 15 14] found in pages:
11 92
Duplicate row [ 3 68  1] found in pages:
48 78
Duplicate row [ 9 17 21] found in pages:
21 62
Duplicate row [17 73 66] found in pages:
1 60
Duplicate row [42 15 24] found in pages:
39 78
Duplicate row [90  8 95] found in pages:
41 61
Duplicate row [19  0 51] found in pages:
30 43
Duplicate row [65 99 51] found in pages:
57 81
Duplicate row [ 5 79 61] found in pages:
17 80
Duplicate row [49 74 71] found in pages:
0 57
Duplicate row [77  3  3] found in pages:
18 57
Duplicate row [37 54 66] found in pages:
5 13
Duplicate row [64 64 82] found in pages:
19 23
Duplicate row [64  6 21] found in pages:
27 39
Duplicate row [39 92 82] found in pages:
8 98
Duplicate row [99 10 15] found in pages:
39 68
Duplicate row [45 16 57] found in pages:
8 53
Duplicate row [12 62 98] found in pages:
29 96
Duplicate row [78 73 56] found in pages:
3 79
Duplicate row [62 87 18] found in pages:
84 97
Duplicate row [46 72 87] found in pages:
5 10
Duplicate row [27 75 25] found in pages:
50 57
Duplicate row [92 62 38] found in pages:
0 87
Duplicate row [27 95 86] found in pages:
15 80

Note, my solution is written in Python 3. You should use d.iteritems() instead of d.items().

There might be a better solution using numpy, but this is quick and works in linear rather than polynomial time.

Fast way to find matching elements in multiple numpy arrays

Answers (2)

Related Questions