Ch3steR
Ch3steR

Reputation: 20689

Finding unique values in each row

I have an array with strings of size of 2 and want to get unique strings in each row.

np.__version__
# '1.19.2'
arr = np.array([['Z7', 'Q4', 'Q4'], # 2 unique strings
                ['Q4', 'Z7', 'Q4'], # 2 unq strings
                ['Q4', 'Z7', 'Z7'], # 2 unq strings
                ['Z7', 'Z7', 'Q4'], # 2 unq strings
                ['D8', 'D8', 'L1'], # 2 unq strings
                ['L1', 'L1', 'D8']], dtype='<U2') # 2 unq strings

It is guaranteed that every row contains the same number of uniques strings i.e. every row will have the same number of unique strings in my case it's 2.

Expected output:

array([['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['Q4', 'Z7'],
       ['D8', 'L1'],
       ['D8', 'L1']], dtype='<U2')

Here, each row is sorted but it's doesn't have to be. It's fine both ways.

My code:

np.apply_along_axis(np.unique, 1, arr)

# array([['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['Q4', 'Z7'],
#        ['D8', 'L1'],
#        ['D8', 'L1']], dtype='<U2')

I thought np.unique over axis 1 would give expected results but

np.unique(arr, axis=1)
# array([['Q4', 'Q4', 'Z7'],
#        ['Q4', 'Z7', 'Q4'],
#        ['Z7', 'Z7', 'Q4'],
#        ['Q4', 'Z7', 'Z7'],
#        ['L1', 'D8', 'D8'],
#        ['D8', 'L1', 'L1']], dtype='<U2')

I couldn't understand what exactly happened and why it returned this exact output.

Upvotes: 1

Views: 1010

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Documentation of np.unique, in the description of axis parameter, contains the following statement:

... subarrays indexed by the given axis will be be flattened treated as the elements of a 1-D array

So if you call np.unique, passing axis=1, then:

  • Each column is flattened (as each column contains "atomic" values, nothing happens).
  • Finding of unique elements is performed on the resulting list (list of columns). If 2 columns were just the same then only one of them would have been retained.
  • The result is presented possibly in a changed order (this is an internal implementation detail.

A bit of explanation why each column (not row): Axis "1" is actually "columns".

To confirm that in this case each column is the processe object, define the source array as:

arr_2 = np.array([['Z7', 'Q4', 'Q4', 'Q4'],
                  ['Q4', 'Z7', 'Q4', 'Q4'],
                  ['Q4', 'Z7', 'Z7', 'Z7'],
                  ['Z7', 'Z7', 'Q4', 'Q4'],
                  ['D8', 'D8', 'L1', 'L1'],
                  ['L1', 'L1', 'D8', 'D8']])

where 2 last columns are just the same.

When you execute np.unique(arr_2, axis=1), the result will be just the same. Two last columns were exactly the same, so one of them has been eliminated.

Upvotes: 1

Stefan
Stefan

Reputation: 957

That is because numpy.unique flattens either the row or column subarrays and then returns the unique rows (axis = 0) or columns (axis = 1), instead of the unique values itself. Take a look at this example:

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.unique(a, axis=0)

The output is:

array([[1, 0, 0], [2, 3, 4]])

and

b = np.array([[1, 1, 0], [1, 1, 0], [2, 2, 4]])
np.unique(b, axis=1)

The output is:

array([[0, 1],
       [0, 1],
       [4, 2]])

In your case you want the unique values per row itself and therefore should apply the along_axis command like you already implemented. The axis = 1 does not do much as your columns are all unique and only shows some sorting.

Upvotes: 2

Related Questions