Convert 2d-array to 2d-array of unique values per row

I have a 2d-array of shape 5x4 like this:

array([[3, 3, 3, 3],
   [3, 3, 3, 3],
   [3, 3, 2, 2],
   [2, 2, 2, 2],
   [2, 2, 2, 2]])

And I'd like to obtain another array that contains arrays of unique values, something like this:

array([array([3]), array([3]), array([2, 3]), array([2]), array([2])],
      dtype=object)

I obtained that with the following code:

np.array([np.unique(row) for row in matrix])

However, this is not vectorized. How could I achieve the same in a vectorized numpy operation?

Upvotes: 0

Views: 142

Answers (2)

Divakar
Divakar

Reputation: 221614

Here's one way to minimize the compute when iterating and should help boost performance -

b = np.sort(a,axis=1)
o = np.ones((len(a),1), dtype=bool)
mask = np.c_[o,b[:,:-1] != b[:,1:]]
c = b[mask]
out = np.split(c, mask.sum(1).cumsum())[:-1]

A loop to use slicing could be better than np.split. So, with each iteration, all we do would be slicing. Hence, the last step could be replaced by something like this -

idx = np.r_[0,mask.sum(1).cumsum()]
out = []
for (i,j) in zip(idx[:-1],idx[1:]):
    out.append(c[i:j])

Upvotes: 1

Paddy Harrison
Paddy Harrison

Reputation: 2002

numpy arrays must have a defined shape, so if your data has only 1 value for some rows and 2 or more for others, then that won't do. A work around is to pad the array with a known value, eg. np.nan.

In this case np.unique will sort it all out for you. If you use its axis argument. In this case you want unique values per row, so we use axis=1:

arr = np.array([[3, 3, 3, 3],
                [3, 3, 3, 3],
                [3, 3, 2, 2],
                [2, 2, 2, 2],
                [2, 2, 2, 2]])

np.unique(arr, axis=1)
>>> array([[3, 3],
           [3, 3],
           [2, 3],
           [2, 2],
           [2, 2]])

The result is an array and has the correct unique values for each row, albeit some are duplicated, but this is the price for having an array.

Upvotes: 1

Related Questions