Reputation: 65
Well, I have a simple problem that is giving me a headache, basically I have two two-dimensional arrays, full of [x,y] coordinates, and I want to compare the first with the second and generate a third array that contains all the elements of the first array that doesn't appear in the second. It's simple but I couldn't make it work at all. The size varies a lot, the first array can have between a thousand and 2 million coordinates, while the first array has between 1 and a thousand.
This operation will occur many times and the larger the first array, the more times it will occur
sample:
arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7], ])
arr2 = np.array([[0, 3], [1, 7]])
result = np.array([[0, 4], [1, 3]])
In Depth: Basically I have a binary image with variable resolution, it is composed of 0 and 1 (255) and I analyze each pixel individually (with an algorithm that is already optimized), but (on purpose) every time this function is executed it analyzes only a fraction of the pixels, and when it is finished it gives me back all the coordinates of these pixels. The problem is that when it executes it runs the following code:
ones = np.argwhere(img == 255) # ones = pixels array
It takes about 0.02 seconds and is by far the slowest part of the code. My idea is to create this variable once and, each time the function ends, it removes the parsed pixels and passes the new array as parameter to continue until the array is empty
Upvotes: 4
Views: 7342
Reputation: 26886
Not sure what you intend to do with the extra dimensions, as the set difference, like any filtering, is inherently losing the shape information.
Anyway, NumPy does provide np.setdiff1d()
to solve this problem elegantly.
EDIT With the clarifications provided, you seems to be looking for a way compute the set difference on a given axis, i.e. the elements of the sets are actually arrays.
There is no built-in specifically for this in NumPy, but it is not too difficult to craft one.
For simplicity, we assume that the operating axis is the first one (so that the element of the set are arr[i]
), that only unique elements appear in the first array, and that the arrays are 2D.
They are all based on the idea that the asymptotically best approach is to build a set()
of the second array and then using that to filter out the entries from the first array.
The idiomatic way to build such set in Python / NumPy is to use:
set(map(tuple, arr))
where the mapping to tuple
freezes arr[i]
, allowing them to be hashable and hence making them available to use with set()
.
Unfortunately, since the filtering would produce results of unpredictable size, NumPy arrays are not the ideal container for the result.
To solve this issue, one can use:
list
import numpy as np
def setdiff2d_list(arr1, arr2):
delta = set(map(tuple, arr2))
return np.array([x for x in arr1 if tuple(x) not in delta])
np.fromiter()
followed by np.reshape()
import numpy as np
def setdiff2d_iter(arr1, arr2):
delta = set(map(tuple, arr2))
return np.fromiter((x for xs in arr1 if tuple(xs) not in delta for x in xs), dtype=arr1.dtype).reshape(-1, arr1.shape[-1])
def setdiff2d_idx(arr1, arr2):
delta = set(map(tuple, arr2))
idx = [tuple(x) not in delta for x in arr1]
return arr1[idx]
set()
(will force uniqueness of the output elements and will lose ordering):import numpy as np
def setdiff2d_set(arr1, arr2):
set1 = set(map(tuple, arr1))
set2 = set(map(tuple, arr2))
return np.array(list(set1 - set2))
Alternatively, the advanced indexing can be built using broadcasting, np.any()
and np.all()
:
def setdiff2d_bc(arr1, arr2):
idx = (arr1[:, None] != arr2).any(-1).all(1)
return arr1[idx]
Some form of the above methods were originally suggested in @QuangHoang's answer.
A similar approach could also be implemented in Numba, following the same idea as above but using a hash instead of the actual array view arr[i]
(because of the limitations in what is supported inside a set()
by Numba) and pre-computing the output size (for speed):
import numpy as np
import numba as nb
@nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
@nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
While they all give the same result:
funcs = setdiff2d_iter, setdiff2d_list, setdiff2d_idx, setdiff2d_set, setdiff2d_bc, setdiff2d_nb
arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7]])
print(arr1)
# [[0 3]
# [0 4]
# [1 3]
# [1 7]]
arr2 = np.array([[0, 3], [1, 7], [4, 0]])
print(arr2)
# [[0 3]
# [1 7]
# [4 0]]
result = funcs[0](arr1, arr2)
print(result)
# [[0 4]
# [1 3]]
for func in funcs:
print(f'{func.__name__:>24s}', np.all(result == func(arr1, arr2)))
# setdiff2d_iter True
# setdiff2d_list True
# setdiff2d_idx True
# setdiff2d_set False # because of ordering
# setdiff2d_bc True
# setdiff2d_nb True
their performance seems to be varying:
for func in funcs:
print(f'{func.__name__:>24s}', end=' ')
%timeit func(arr1, arr2)
# setdiff2d_iter 16.3 µs ± 719 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_list 14.9 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_idx 17.8 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_set 17.5 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_bc 9.45 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# setdiff2d_nb 1.58 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The Numba-based approach proposed seems to outperform the other approaches by a fair margin (some 10x using the given input).
Similar timings are observed with larger inputs:
np.random.seed(42)
arr1 = np.random.randint(0, 100, (1000, 2))
arr2 = np.random.randint(0, 100, (1000, 2))
print(setdiff2d_nb(arr1, arr2).shape)
# (736, 2)
for func in funcs:
print(f'{func.__name__:>24s}', end=' ')
%timeit func(arr1, arr2)
# setdiff2d_iter 3.51 ms ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_list 2.92 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_idx 2.61 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_set 3.52 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setdiff2d_bc 25.6 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setdiff2d_nb 192 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(As a side note, setdiff2d_bc()
is the most negatively affected by the size of the second input).
Upvotes: 4
Reputation: 150735
Depending on how large your arrays are. If they are not too large (few thousands), you can
x
to each point in y
any
to check for inequality at the last dimensionall
to check for matchingCode:
idx = (arr1[:,None]!=arr2).any(-1).all(1)
arr1[idx]
Output:
array([[0, 4],
[1, 3]])
update: for longer data, you can try set and a for
loop:
set_y = set(map(tuple, y))
idx = [tuple(point) not in set_y for point in x]
x[idx]
Upvotes: 1