Reputation: 91
Similar to retrun_inverse
in numpy.unique
,
if I have a numpy array a:[['a' 'b'] ['b' 'c'] ['c' 'c'] ['c' 'b']]
,
I want to convert array b:[['b' 'c'] ['a' 'b'] ['c' 'c'] ['a' 'b'] ['c' 'c']]
to [1 0 2 0 2]
.
Is there any smart way to convert it?
Upvotes: 0
Views: 718
Reputation: 26946
Perhaps this is more easily done with plain list
s (which you can get from NumPy arrays with .tolist()
method):
a = [['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']]
b = [['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']]
print([a.index(x) for x in b])
# [1, 0, 2, 0, 2]
or, writing it as a function and assuming NumPy arrays inputs and outputs and handling the case where a needle is not in the haystack:
import numpy as np
def find_by_list(haystack, needles):
haystack = haystack.tolist()
result = []
for needle in needles.tolist():
try:
result.append(haystack.index(needle))
except ValueError:
result.append(-1)
return np.array(result)
This comes out approximately as fast as a more NumPy-aware solutions based on np.where()
(provided that the reduction operated by np.all()
can be done on the first axis), e.g.:
import numpy as np
def find_by_np(haystack, needles, haystack_axis=-1, needles_axis=-1, keepdims=False):
if haystack_axis:
haystack = haystack.swapaxes(0, haystack_axis)
if needles_axis:
needles = needles.swapaxes(0, needles_axis)
n = needles.shape[0]
m = haystack.ndim - 1
shape = haystack.shape[1:]
result = np.full((m,) + needles.shape[1:], -1)
haystack = haystack.reshape(n, -1)
needles = needles.reshape(n, -1)
_, match, index = np.nonzero(np.all(
haystack[:, None, :] == needles[:, :, None],
axis=0, keepdims=True))
result.reshape(m, -1)[:, match] = np.unravel_index(index, shape)
if not keepdims and result.shape[0] == 1:
result = np.squeeze(result, 0)
return result
but both are slower than simple loops accelerated with Numba JIT, e.g.:
import numpy as np
import numba as nb
def find_by_loop(haystack, needles):
n, m = haystack.shape
l, m_ = needles.shape
result = np.full(l, -1)
if m != m_:
return result
for i in range(l):
for j in range(n):
is_equal = True
for k in range(m):
if haystack[j, k] != needles[i, k]:
is_equal = False
break
if is_equal:
break
if is_equal:
result[i] = j
return result
find_by_nb = nb.jit(find_by_loop)
find_by_nb.__name__ = 'find_by_nb'
While they all give the same result:
funcs = find_by_list, find_by_np, find_by_loop, find_by_nb
a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
print(a.shape, b.shape)
for func in funcs:
print(f'{func.__name__:>12s}(a, b) = {func(a, b)}')
# find_by_list(a, b) = [1 0 2 0 2]
# find_by_np(a, b) = [1 0 2 0 2]
# find_by_loop(a, b) = [1 0 2 0 2]
# find_by_nb(a, b) = [1 0 2 0 2]
The timings comes as follows:
print(f'({"n":<4s}, {"m":<4s}, {"k":<4s})', end=' ')
for func in funcs:
print(f'{func.__name__:>15s}', end=' ')
print()
for n, m, k in itertools.product((5, 50, 500), repeat=3):
a = np.random.randint(0, 100, (k, n))
b = np.random.randint(0, 100, (m, n))
print(f'({n:<4d}, {m:<4d}, {k:<4d})', end=' ')
for func in funcs:
result = %timeit -n3 -r10 -q -o func(a, b)
print(f'{result.best * 1e3:12.3f} ms', end=' ')
print()
# (n , m , k ) find_by_list find_by_np find_by_loop find_by_nb
# (5 , 5 , 5 ) 0.008 ms 0.048 ms 0.021 ms 0.001 ms
# (5 , 5 , 50 ) 0.018 ms 0.031 ms 0.176 ms 0.001 ms
# (5 , 5 , 500 ) 0.132 ms 0.092 ms 1.754 ms 0.006 ms
# (5 , 50 , 5 ) 0.065 ms 0.031 ms 0.184 ms 0.001 ms
# (5 , 50 , 50 ) 0.139 ms 0.093 ms 1.756 ms 0.006 ms
# (5 , 50 , 500 ) 1.096 ms 0.684 ms 17.546 ms 0.049 ms
# (5 , 500 , 5 ) 0.658 ms 0.093 ms 1.871 ms 0.006 ms
# (5 , 500 , 50 ) 1.383 ms 0.699 ms 17.504 ms 0.051 ms
# (5 , 500 , 500 ) 9.102 ms 7.752 ms 177.754 ms 0.491 ms
# (50 , 5 , 5 ) 0.026 ms 0.061 ms 0.022 ms 0.001 ms
# (50 , 5 , 50 ) 0.054 ms 0.042 ms 0.174 ms 0.002 ms
# (50 , 5 , 500 ) 0.356 ms 0.203 ms 1.759 ms 0.006 ms
# (50 , 50 , 5 ) 0.232 ms 0.042 ms 0.185 ms 0.001 ms
# (50 , 50 , 50 ) 0.331 ms 0.205 ms 1.744 ms 0.006 ms
# (50 , 50 , 500 ) 1.332 ms 2.422 ms 17.492 ms 0.051 ms
# (50 , 500 , 5 ) 2.328 ms 0.197 ms 1.882 ms 0.006 ms
# (50 , 500 , 50 ) 3.092 ms 2.405 ms 17.618 ms 0.052 ms
# (50 , 500 , 500 ) 11.088 ms 18.989 ms 175.568 ms 0.479 ms
# (500 , 5 , 5 ) 0.205 ms 0.035 ms 0.023 ms 0.001 ms
# (500 , 5 , 50 ) 0.410 ms 0.137 ms 0.187 ms 0.001 ms
# (500 , 5 , 500 ) 2.800 ms 1.914 ms 1.894 ms 0.006 ms
# (500 , 50 , 5 ) 1.868 ms 0.138 ms 0.201 ms 0.001 ms
# (500 , 50 , 50 ) 2.154 ms 1.814 ms 1.902 ms 0.006 ms
# (500 , 50 , 500 ) 6.352 ms 16.343 ms 19.108 ms 0.050 ms
# (500 , 500 , 5 ) 19.798 ms 1.957 ms 2.020 ms 0.006 ms
# (500 , 500 , 50 ) 20.922 ms 13.571 ms 18.850 ms 0.052 ms
# (500 , 500 , 500 ) 35.947 ms 139.923 ms 189.747 ms 0.481 ms
Showing that Numba provides the fastest (and most memory efficient) solution, while its non-JIT accelerated version provides the slowest. The NumPy-based one and the list
-based one come out somewhere in between, with varying speed. But for larger inputs, the list
based one should be faster on average because it provides better short-circuiting.
Upvotes: 1
Reputation: 1744
An interesting way to do things, perhaps?
a.append(None)
aa = np.array(a)[:-1] # Note 1
b.append(None)
bb = np.array(b)[:-1]
ind_arr = bb[:, None] == aa # Note 2
np.nonzero(ind_arr)[1]
Note 1: The first step is more like an overhead to get an object
type 1-D array. Otherwise, numpy
forces a 2-D str
type array, which is unhelpful for this application. Read more about it in this answer. It also states a few alternatives.
Note 2: This creates a 2-D boolean mask with each element of aa
compared with each element of bb
for equality, like so: ind_arr[i, j] = (bb[i] == aa[j])
.
The next line uses this mask and extracts the True
values along axis 1 (where the comparison had evaluated to True
). This is because aa
values in the comparison mask are along axis 1.
Another discussion to understand this better.
If you're looking for speed though, for lists, norok2's answer is quite a lot faster. This perhaps, can have innovative applications. Cheers!
Upvotes: 0
Reputation: 2905
Not the most elegant solution, but it works:
Setting up (in the future, show the code to generate your example, it'll make it faster to answer):
import numpy as np
a = np.array([['a', 'b'], ['b', 'c'], ['c', 'c'], ['c', 'b']])
b = np.array([['b', 'c'], ['a', 'b'], ['c', 'c'], ['a', 'b'], ['c', 'c']])
desired_output = [1, 0, 2, 0, 2]
Using the numpy.where function (as in this related question: Is there a NumPy function to return the first index of something in an array?)
We use np.where
per item in each row, multiply the boolean results, and use list comprehension to pass row by row:
output = [np.where((x[0]==a[:,0]) * (x[1]==a[:,1]))[0][0] for x in b]
It gives back your desired result.
Upvotes: 0