belenon
belenon

Reputation: 31

finding indexes of common subarrays

I have 2 large, unsorted arrays (structured set of xyz coordinates) and I'm trying to find the positions of all identical subarrays (common points consisting of 3 coordinates). Example:

a = array([[0, 1, 2], [3, 4, 5]])
b = array([[3, 4, 5], [6, 7, 8]])

Here the correct subarray would be [3, 4, 5], but more than one identical subarrays are possible. The correct indexes would be [0,1] for a and [1,0] for b.

I already implemented a pure python method by iterating over all points of one array and comparing them to every point of the other array, but this is extremely slow.

My question is, is there an efficient way to find the indexes for both arrays (preferably in numpy, because I need the arrays for further calculations)? Perhaps a rolling_window approach?

Upvotes: 2

Views: 1627

Answers (3)

belenon
belenon

Reputation: 31

I did a little further experimenting and found a numpy specific way to solve this:

import numpy as np

a = np.arange(24).reshape(2,4,3)
b = np.arange(24, 36).reshape(2,2,3)

Array b receives 2 entries from a:

b[1,0] = a[0,1]
b[0,1] = a[1,1]

Finding common entries:

c = np.in1d(a, b).reshape(a.shape)
d = np.in1d(b, a).reshape(b.shape)

Checking where common entries exist in all 3 coordinates:

indexesC = np.where(c[:,:,0] & c[:,:,1] & c[:,:,2])
indexesD = np.where(d[:,:,0] & d[:,:,1] & d[:,:,2])

Upvotes: 1

D_K
D_K

Reputation: 1420

Can you map each sub-array to its position index in a hash table? So basically, you change your data structure. After that in linear time O(n), where n is the size of the biggest hash hash table, in one loop you can O(1) query each hash table and find out if you have same sub-array present in two or more hash tables.

Upvotes: 0

agf
agf

Reputation: 176740

A general solution for Python iterables (not specific to numpy or arrays) that works in linear average time (O(n+m), n is the number of subarrays and m is the number of unique subarrays):

a = [[0, 1, 2], [3, 4, 5]]
b = [[3, 4, 5], [6, 7, 8]]

from collections import defaultdict

indexmap = defaultdict(list)

for row, sublist in enumerate((a, b)):
    for column, item in enumerate(sublist):
        indexmap[tuple(item)].append((row, column))

repeats = dict((key, value) for key, value in indexmap.iteritems() if len(value) > 1)

Gives

{(3, 4, 5): [(0, 1), (1, 0)]}

If you don't need the double-row-indexes (index in the list and in the stored index) you can simplify the loop to

for row in (a, b):
    for column, item in enumerate(sublist):
        indexmap[tuple(item)].append(column)

as a will be processed before b, any duplicates will get numbered by row automatically:

{(3, 4, 5): [1, 0]}

With repeats[key][rownum] returning the column index for that row.

Upvotes: 2

Related Questions