Reputation: 728
I have two arrays arr1
and arr2
with sizes (90000,1)
and (120000,1)
. I'd like to find out if any element of axis=0
of arr1
is present on arr2
. Then write their positions on to a list and later remove them. This will ensure that none of the elements on either lists could be found on the other. For now, I'm using for
loops:
list_conflict=[]
for i in range (len(arr1)):
for j in range (len(arr2)):
if (arr1[i]==arr2[j]):
list_conflict.append([i,j])
fault_index_pos = np.unique([x[0] for x in list_conflict])
fault_index_neg = np.unique([x[1] for x in list_conflict])
X_neg = np.delete(X_neg,fault_index_neg,axis=0)
X_pos = np.delete(X_pos,fault_index_pos,axis=0)
It takes an element of arr1
on outer loop and compares it with every element of arr2
exhaustively. If finds a match, appends indices list_conflict
with first element being arr1
position and second arr2
. Then fault_index_pos
and fault_index_neg
are squeezed into unique elements, since an element of arr1
could be on multiple places of arr2
and list will have recurrent positions. Finally, matching elements are removed with np.delete
by taking fault_index
lists as index to be deleted.
I'm looking for a faster approach for conflict comparison call it multiprocessing
, vectorization
or anything else. You could say it won't take much time but actually arrays are in (x,8,10)
dimensions but I shortened them for sake of clarity.
Upvotes: 0
Views: 703
Reputation: 62503
import numpy as np
import pandas as pd
# create test data
np.random.seed(1)
a = np.random.randint(10, size=(10, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(10, 1))
# create dataframe
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
# find unique values in df_a
unique_a = df_a[0].unique().tolist()
# create a Boolean mask and return only values of df_b not found in df_a
values_not_in_a = df_b[~df_b[0].isin(unique_a)].to_numpy()
a = array([[5],
[8],
[9],
[5],
[0],
[0],
[1],
[7],
[6],
[9]])
b = array([[13],
[11],
[12],
[ 8],
[ 9],
[11],
[13],
[ 8],
[ 8],
[ 9]])
# final output array
values_not_in_a = array([[13],
[11],
[12],
[11],
[13]])
import numpy
# create test data
np.random.seed(1)
a = np.random.randint(10, size=(10, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(10, 1))
ua = np.unique(a) # unique values of a
ub = np.unique(b) # unique values of b
mask_b = np.isin(b, ua, invert=True)
mask_a = np.isin(a, ub, invert=True)
b_values_not_in_a = b[mask_b]
a_values_not_in_b = a[mask_a]
# b_values_not_in_a
array([13, 11, 12, 11, 13])
# a_values_not_in_b
array([5, 5, 0, 0, 1, 7, 6])
timeit
# using the following arrays
np.random.seed(1)
a = np.random.randint(10, size=(90000, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(120000, 1))
%%timeit
5.6 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 13589
As @Prune suggested, here's a solution that uses set
s:
overlap = np.array(list(set(arr1) & set(arr2))) # Depending on array shapes you may need to flatten or slice first
arr1 = arr1[~np.isin(arr1, overlap)]
arr2 = arr2[~np.isin(arr2, overlap)]
Upvotes: 0
Reputation: 70715
Ignoring the numpy part, finding the conflicting index pairs can be done much faster in pure Python, taking time proportional to len(a)
plus len(b)
plus the number of conflicts, rather than the nested loops which take time proportional to the product of the vectors' lengths:
def conflicts(a, b):
from collections import defaultdict
elt2ix = defaultdict(list)
for i, elt in enumerate(a):
elt2ix[elt].append(i)
for j, elt in enumerate(b):
if elt in elt2ix:
for i in elt2ix[elt]:
yield i, j
Then, e.g.,
for pair in conflicts([1, 2, 4, 5, 2], [2, 3, 8, 4]):
print(pair)
displays
(1, 0)
(4, 0)
(2, 3)
which are the indices of the matching occurrences of 2 and 4.
Upvotes: 3
Reputation: 77860
Please work through some tutorials on the vector capabilities of NumPy, as well as the sequence inclusion operators of Python. You are trying to program a large-scale application that sorely needs language facilities you haven't yet learned.
That said, perhaps the fastest way to do this is to convert each to a set
and take the set intersection. The involved operations are O(n) for a sequence/set of N elements; your nested loop is O(N*M) (on the two sequence sizes).
Any tutorial on Python sets will walk you through this.
Upvotes: 0