Reputation: 1214
I have a list (let's call it all_my_arrays
) that contains about 48,000 1D arrays. I'd like to know how many duplicate arrays are in this list, if any. However, I'd like to exclude empty arrays (because I have multiple empty arrays within the list and don't want those factored into my duplicate count). I tried this code below, but it is taking too long:
import numpy as np
uniques=[]
for arr in all_my_arrays:
if not np.array_equal(np.array([]), arr):
if not any(np.array_equal(arr, unique_arr) for unique_arr in uniques):
uniques.append(arr)
print(len(uniques)) #number of non-duplicates
Is there a much quicker way to accomplish this?
Upvotes: 4
Views: 2970
Reputation: 14399
Just do
in_arr = np.array([i for i in all_my_arrays if i.size == 7])
uniques = np.unique(in_arr, axis = 0)
uniques_list = list(uniques) # if you really want a list
EDIT: Beware that np.unique
sorts internally, so order is not preserved. If you want to maintain order you'll probably need to build a speciality numba
function.
Upvotes: 1
Reputation: 103694
I am not sure that I am understanding your problem fully.
However, if I understand:
0
;If that is correct, you can do this:
import numpy as np
li_of_arr=[np.random.choice(4,7) for _ in range(50000)] # example list of arrays
mat=np.array([e for e in li_of_arr if np.any(e)]) # create a numpy matrix of non zero arrays
uniq=np.unique(mat,axis=0) # here are your unique sub arrays
print (len(mat), len(uniq)) # len of non zero elements and unique non zero elements
This takes less than 1 second on my MacBook.
Upvotes: 0
Reputation: 15652
You can use the set
type to get the unique values in your list. First you have to convert the arrays to hashable types (tuple here is good). Here's an example:
uniques = set(tuple(arr) for arr in all_my_arrays if arr.size > 0)
The set uniques
will contain all the unique, non-empty arrays from your original all_my_arrays
list. The contents of uniques
are tuples, but you can convert them back to arrays with a list comprehension. If you're only interested in the number of unique arrays, then you can just call len(uniques)
and not worry about converting back to arrays.
This approach has time complexity O(n + m)
where n
is the number of arrays and m
is the length of each. There is however the overhead of converting to tuples, but I believe this method should be faster than what you have so far (which has time complexity O(n^2)
) especially for such a large number of arrays.
Edit: To make this a bit faster, you can remove the empty check on each element and then just handle that at the end. Here's what that would look like:
uniques = set(tuple(arr) for arr in all_my_arrays)
num_unique = len(uniques) if () not in uniques else len(uniques) - 1
Upvotes: 2
Reputation: 164613
Here's a NumPy-based solution which may work for you. For simplicity, I'll use an example where arrays are either of length 3 or 0.
Note that count of duplicates as per your definition equals total number of arrays minus number of empty arrays minus number of unique non-empty arrays.
The last part may be most expensive, but we can use np.unique
on a regular NumPy array for this:
L = [np.array([1, 2, 3]), np.array([]), np.array([3, 2, 1]),
np.array([1, 3, 2]), np.array([1, 2, 3]), np.array([1, 3, 2]),
np.array([]), np.array([]), np.array([1, 2, 3])]
empty_arr = {idx for idx, arr in enumerate(L) if arr.shape == (0,)}
empty_count = len(empty_arr)
A = np.array([arr for idx, arr in enumerate(L) if idx not in empty_arr])
unique_arr = np.unique(A, axis=0)
duplicates = len(L) - len(empty_arr) - len(unique_arr)
print(duplicates) # 3
Upvotes: 0