Reputation: 957
I am trying to find matching elements in a large number of large numpy arrays. The way I am doing this currently (holding the arrays in a dictionary and looping over them several times) is very slow, and am looking for a faster way to tackle the problem.
Here is a self-contained section of code that presents the problem:
import numpy as np
data = {}
for frame in xrange(100):
data[frame] = np.random.randint(100, size=(100, 3))
# data structure is 100 'pages' each containing an array of 100 elements
# trying to find matching elements from arrays on different pages
for page in data:
for point in data[page]:
for page_to_match in data:
if page_to_match == page: # exclude matches on the same page
pass
else:
for point_to_match in data[page_to_match]:
if np.array_equal(point, point_to_match):
print 'Found a match -', point, '- pages:', page, page_to_match
# should find about 0 - 3 matches per page
As you can see it works but very poorly.
Edit: here is a minimal version of the code. It works quickly for small arrays like this but slowly for large arrays like above. Replace the first three lines in the above section with the following:
data = {}
data[0] = np.array([[1,2],[3,4]])
data[1] = np.array([[5,6],[7,8]])
data[2] = np.array([[3,4],[5,8]])
Upvotes: 0
Views: 436
Reputation: 96287
If memory isn't an issue, I would solve the problem like this:
>>> data = {}
>>> for frame in range(100):
... data[frame] = np.random.randint(100, size=(100, 3))
...
>>> from collections import defaultdict
>>> grouper = defaultdict(list)
>>> for k, v in data.items():
... for row in v:
... grouper[tuple(row)].append(k)
...
>>> for k, v in grouper.items():
... if len(v) > 1:
... print("Duplicate row", np.array(k), "found in pages:")
... print(*v)
...
Results:
Duplicate row [89 50 83] found in pages:
13 96
Duplicate row [76 88 29] found in pages:
32 56
Duplicate row [74 26 81] found in pages:
11 99
Duplicate row [19 4 53] found in pages:
17 44
Duplicate row [56 20 4] found in pages:
41 91
Duplicate row [92 62 50] found in pages:
6 45
Duplicate row [86 9 39] found in pages:
12 62
Duplicate row [64 47 84] found in pages:
5 51
Duplicate row [52 74 44] found in pages:
9 19
Duplicate row [60 21 39] found in pages:
14 47
Duplicate row [80 42 33] found in pages:
65 82
Duplicate row [ 4 63 85] found in pages:
8 24
Duplicate row [70 84 35] found in pages:
42 69
Duplicate row [54 14 31] found in pages:
43 47
Duplicate row [38 81 38] found in pages:
51 67
Duplicate row [55 59 10] found in pages:
29 54
Duplicate row [84 77 37] found in pages:
51 53
Duplicate row [76 27 54] found in pages:
33 39
Duplicate row [52 64 20] found in pages:
1 37
Duplicate row [65 97 45] found in pages:
61 80
Duplicate row [69 52 8] found in pages:
60 85
Duplicate row [51 2 37] found in pages:
1 52
Duplicate row [31 36 53] found in pages:
50 84
Duplicate row [24 57 1] found in pages:
34 88
Duplicate row [87 89 0] found in pages:
11 78
Duplicate row [94 38 17] found in pages:
40 89
Duplicate row [46 25 46] found in pages:
54 87
Duplicate row [34 15 14] found in pages:
11 92
Duplicate row [ 3 68 1] found in pages:
48 78
Duplicate row [ 9 17 21] found in pages:
21 62
Duplicate row [17 73 66] found in pages:
1 60
Duplicate row [42 15 24] found in pages:
39 78
Duplicate row [90 8 95] found in pages:
41 61
Duplicate row [19 0 51] found in pages:
30 43
Duplicate row [65 99 51] found in pages:
57 81
Duplicate row [ 5 79 61] found in pages:
17 80
Duplicate row [49 74 71] found in pages:
0 57
Duplicate row [77 3 3] found in pages:
18 57
Duplicate row [37 54 66] found in pages:
5 13
Duplicate row [64 64 82] found in pages:
19 23
Duplicate row [64 6 21] found in pages:
27 39
Duplicate row [39 92 82] found in pages:
8 98
Duplicate row [99 10 15] found in pages:
39 68
Duplicate row [45 16 57] found in pages:
8 53
Duplicate row [12 62 98] found in pages:
29 96
Duplicate row [78 73 56] found in pages:
3 79
Duplicate row [62 87 18] found in pages:
84 97
Duplicate row [46 72 87] found in pages:
5 10
Duplicate row [27 75 25] found in pages:
50 57
Duplicate row [92 62 38] found in pages:
0 87
Duplicate row [27 95 86] found in pages:
15 80
Note, my solution is written in Python 3. You should use d.iteritems()
instead of d.items()
.
There might be a better solution using numpy
, but this is quick and works in linear rather than polynomial time.
Upvotes: 2
Reputation: 1576
Assuming you want to accelerate your code it seems to be very well suited for Cython. Pythons for loop has some overhead and depending on what you are comparing you may not need rich comparisons. This gives some great potential for acceleration. In Cython you can compile your slightly modified Python code, so your loops run as fast as native C loops. Here is a little example:
cdef int i, I
cdef int x[1000]
i, I = 0, 1000
for i in range(1000):
x[i] = i
When compiled with Cython, first C code is generated from the Cython-Code. Then it can be executed highly accelerated in comparison to pure Python.
Upvotes: 0