Reputation: 31
I have 2 large, unsorted arrays (structured set of xyz coordinates) and I'm trying to find the positions of all identical subarrays (common points consisting of 3 coordinates). Example:
a = array([[0, 1, 2], [3, 4, 5]])
b = array([[3, 4, 5], [6, 7, 8]])
Here the correct subarray would be [3, 4, 5]
, but more than one identical subarrays are possible. The correct indexes would be [0,1]
for a
and [1,0]
for b
.
I already implemented a pure python method by iterating over all points of one array and comparing them to every point of the other array, but this is extremely slow.
My question is, is there an efficient way to find the indexes for both arrays (preferably in numpy, because I need the arrays for further calculations)? Perhaps a rolling_window approach?
Upvotes: 2
Views: 1627
Reputation: 31
I did a little further experimenting and found a numpy specific way to solve this:
import numpy as np
a = np.arange(24).reshape(2,4,3)
b = np.arange(24, 36).reshape(2,2,3)
Array b receives 2 entries from a:
b[1,0] = a[0,1]
b[0,1] = a[1,1]
Finding common entries:
c = np.in1d(a, b).reshape(a.shape)
d = np.in1d(b, a).reshape(b.shape)
Checking where common entries exist in all 3 coordinates:
indexesC = np.where(c[:,:,0] & c[:,:,1] & c[:,:,2])
indexesD = np.where(d[:,:,0] & d[:,:,1] & d[:,:,2])
Upvotes: 1
Reputation: 1420
Can you map each sub-array to its position index in a hash table? So basically, you change your data structure. After that in linear time O(n), where n is the size of the biggest hash hash table, in one loop you can O(1) query each hash table and find out if you have same sub-array present in two or more hash tables.
Upvotes: 0
Reputation: 176740
A general solution for Python iterables (not specific to numpy or arrays) that works in linear average time (O(n+m), n is the number of subarrays and m is the number of unique subarrays):
a = [[0, 1, 2], [3, 4, 5]]
b = [[3, 4, 5], [6, 7, 8]]
from collections import defaultdict
indexmap = defaultdict(list)
for row, sublist in enumerate((a, b)):
for column, item in enumerate(sublist):
indexmap[tuple(item)].append((row, column))
repeats = dict((key, value) for key, value in indexmap.iteritems() if len(value) > 1)
Gives
{(3, 4, 5): [(0, 1), (1, 0)]}
If you don't need the double-row-indexes (index in the list and in the stored index) you can simplify the loop to
for row in (a, b):
for column, item in enumerate(sublist):
indexmap[tuple(item)].append(column)
as a
will be processed before b
, any duplicates will get numbered by row automatically:
{(3, 4, 5): [1, 0]}
With repeats[key][rownum]
returning the column index for that row.
Upvotes: 2