user3357979
user3357979

Reputation: 627

Check that all rows in numpy array are unique

I have a 4 column array:

A=array([[100,1,500,1],
         [100,1,501,1],
         [101,1,501,1],
         [102,2,502,2],
         [500,1,100,1],
         [100,1,500,1],
         [502,2,102,2],
         [502,1,102,1]])

I want to extract the rows that are unique (or the first occurrence) and the rows such that for a row i, there are no other rows j in the array where A[i,:]==A[j,[2,1,0,3]] (or the first occurrence).

So for array A, I would like to get an array that looks like:

B=array([[100,1,500,1],
         [100,1,501,1],
         [101,1,501,1],
         [102,2,502,2],
         [502,1,102,1]])

Thank you for the help!

Upvotes: 4

Views: 1930

Answers (3)

sabbahillel
sabbahillel

Reputation: 4425

I do not understand what the second part of the question means (the part that starts A[i:]) However as a simple loop you can also use

B = []
for data in A:
  if data not in B:
    B.append(data)
tuple(B)

This will now loop through A, check that it is not already in B and append it. This is not the most efficient way to do it since you need to loop through B each time, but it is simple and obvious.

Are you asking that if data = [0, 1, 2, 3] that there does not exist a line like [2, 0, 1, 3]

if that is the case then add to the if

and [data[2], data[0], data[1], data[3]] not in B

I think that setting this up as a tree and branch processing would possibly work. That is, at A[0] create the branches by the value of A[1]. Then the branch for A[2] and the leaf at A[3}. At the final leaf, drop it if a leaf already exists. Once the tree structure has been built, go back and collect all the branches and leaves into the array structure and by definition, they will be unique. In any case, this would require reading the initial list once and would not require reading the B list fully for every row in A. I have not had time to figure out how to express this in Python rather than as a visual flow chart. Perhaps it can work as a set of dictionaries, but I am not sure.

{A[0]:{A[1]:{A[2]:{A[3]:True}}}}

would not work because it would overwrite the value for the key. Perhaps a list of dictionaries for each key and subkey would be possible, giving a set of dictionaries of the key A[1] and a list of dictionaries for the key A[2]

As each entry is read, if A[3] is not found already in a particular list, append it. If it is found, then it is a duplicate. I think that the base would be something on the order of

{A[0]:({A[1]:({A[2]:(A[3])})})}

I do not know if this concept would work or not, but would require the appropriate for loops and appends and might be too complex to set up.

Upvotes: 0

askewchan
askewchan

Reputation: 46530

A[np.unique(np.sort(A,1).view("int, int, int, int"), return_index=True)[1]]

In steps:

In [385]: A
Out[385]: 
array([[100,   1, 500,   1],
       [100,   1, 501,   1],
       [101,   1, 501,   1],
       [102,   2, 502,   2],
       [500,   1, 100,   1],
       [100,   1, 500,   1],
       [502,   2, 102,   2],
       [502,   1, 102,   1]])

We can eliminate the need for swapping columns 0 and 2 (the thing where A[i] = A[j, [2,1,0,3]) simply by sorting each row. We don't have to worry about swapping columns 1 and 3, since for all rows in A, we have column 1 equals column 3: A[:, 1] == A[:, 3].

In [386]: As = np.sort(A,1)

In [387]: As
Out[387]: 
array([[  1,   1, 100, 500],
       [  1,   1, 100, 501],
       [  1,   1, 101, 501],
       [  2,   2, 102, 502],
       [  1,   1, 100, 500],
       [  1,   1, 100, 500],
       [  2,   2, 102, 502],
       [  1,   1, 102, 502]])

Find the unique rows in As (the sorted array). View it as a structured array where each row is a single element (since np.unique will otherwise flatten the array first)

In [388]: As.view('int, int, int, int')
Out[388]: 
array([[(1, 1, 100, 500)],
       [(1, 1, 100, 501)],
       [(1, 1, 101, 501)],
       [(2, 2, 102, 502)],
       [(1, 1, 100, 500)],
       [(1, 1, 100, 500)],
       [(2, 2, 102, 502)],
       [(1, 1, 102, 502)]], 
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8')])

In [389]: u, i = np.unique(As.view('int, int, int, int'), return_index=True)

In [390]: i
Out[390]: array([0, 1, 2, 7, 3])

And use them to get the rows that were unique in As from the original array A:

In [391]: A[i]
Out[391]: 
array([[100,   1, 500,   1],
       [100,   1, 501,   1],
       [101,   1, 501,   1],
       [502,   1, 102,   1],
       [102,   2, 502,   2]])

Upvotes: 3

Javier
Javier

Reputation: 742

For the unique rows, you can use python sets to remove duplicates. Since np arrays (or lists) are unhashable, you might have to convert rows to tuples while selecting, and then convert back to an array:

B = np.array(list(set([tuple(x) for x in A])))

For the second part of your question, you need to implement your own selection loop:

B = []
for row in A:
    lrow = list(row)
    if lrow not in B and [lrow[2], lrow[1], lrow[0], lrow[3]] not in B:
        B.append(lrow)
B = np.array(B)

Upvotes: 0

Related Questions