curious_cosmo
curious_cosmo

Reputation: 1214

Quick way to check for duplicate arrays within list

I have a list (let's call it all_my_arrays) that contains about 48,000 1D arrays. I'd like to know how many duplicate arrays are in this list, if any. However, I'd like to exclude empty arrays (because I have multiple empty arrays within the list and don't want those factored into my duplicate count). I tried this code below, but it is taking too long:

import numpy as np
uniques=[]
for arr in all_my_arrays:
    if not np.array_equal(np.array([]), arr):
        if not any(np.array_equal(arr, unique_arr) for unique_arr in uniques):
           uniques.append(arr)
print(len(uniques)) #number of non-duplicates

Is there a much quicker way to accomplish this?

Upvotes: 4

Views: 2970

Answers (4)

Daniel F
Daniel F

Reputation: 14399

Just do

in_arr = np.array([i for i in all_my_arrays if i.size == 7])
uniques = np.unique(in_arr, axis = 0)
uniques_list = list(uniques)  # if you really want a list

EDIT: Beware that np.unique sorts internally, so order is not preserved. If you want to maintain order you'll probably need to build a speciality numba function.

Upvotes: 1

dawg
dawg

Reputation: 103694

I am not sure that I am understanding your problem fully.

However, if I understand:

  1. You have a Python list of 7 element numpy arrays;
  2. You want to filter out the arrays that are all 0;
  3. You then want to get a unique count of what is left.

If that is correct, you can do this:

import numpy as np
li_of_arr=[np.random.choice(4,7) for _ in range(50000)] # example list of arrays
mat=np.array([e for e in li_of_arr if np.any(e)])   # create a numpy matrix of non zero arrays
uniq=np.unique(mat,axis=0)    # here are your unique sub arrays
print (len(mat), len(uniq))   # len of non zero elements and unique non zero elements

This takes less than 1 second on my MacBook.

Upvotes: 0

Henry Woody
Henry Woody

Reputation: 15652

You can use the set type to get the unique values in your list. First you have to convert the arrays to hashable types (tuple here is good). Here's an example:

uniques = set(tuple(arr) for arr in all_my_arrays if arr.size > 0)

The set uniques will contain all the unique, non-empty arrays from your original all_my_arrays list. The contents of uniques are tuples, but you can convert them back to arrays with a list comprehension. If you're only interested in the number of unique arrays, then you can just call len(uniques) and not worry about converting back to arrays.

This approach has time complexity O(n + m) where n is the number of arrays and m is the length of each. There is however the overhead of converting to tuples, but I believe this method should be faster than what you have so far (which has time complexity O(n^2)) especially for such a large number of arrays.

Edit: To make this a bit faster, you can remove the empty check on each element and then just handle that at the end. Here's what that would look like:

uniques = set(tuple(arr) for arr in all_my_arrays)
num_unique = len(uniques) if () not in uniques else len(uniques) - 1

Upvotes: 2

jpp
jpp

Reputation: 164613

Here's a NumPy-based solution which may work for you. For simplicity, I'll use an example where arrays are either of length 3 or 0.

Note that count of duplicates as per your definition equals total number of arrays minus number of empty arrays minus number of unique non-empty arrays.

The last part may be most expensive, but we can use np.unique on a regular NumPy array for this:

L = [np.array([1, 2, 3]), np.array([]), np.array([3, 2, 1]),
     np.array([1, 3, 2]), np.array([1, 2, 3]), np.array([1, 3, 2]),
     np.array([]), np.array([]), np.array([1, 2, 3])]

empty_arr = {idx for idx, arr in enumerate(L) if arr.shape == (0,)}
empty_count = len(empty_arr)

A = np.array([arr for idx, arr in enumerate(L) if idx not in empty_arr])

unique_arr = np.unique(A, axis=0)

duplicates = len(L) - len(empty_arr) - len(unique_arr)

print(duplicates)  # 3

Upvotes: 0

Related Questions