b10hazard
b10hazard

Reputation: 7799

How do I get the indexes of unique row for a specified column in a two dimensional array

If I have a numpy index like this....

import numpy as np

a = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1],
])

How would I find the index of the rows where the values in one or more specified columns are unique? What I mean is... If I specify a column as a "mask" how would I find the unique rows using that column as a mask? For example, if I wanted...

Unique rows with respect to column 0 (column 0 is the mask). I would want a return like this....

[[0,1],[2,3]]

because if you were to use column 0 as the criteria for uniqueness rows 0 and 1 would be in the same "unique group" and rows 2 and 3 would be in another "unique group" because they have the same value in column 0.

If I wanted the rows with respect to column 1 (column 1 is now the mask) I would like to have an output like this....

[[0,2],[1,3]]

because using column 1 as the criteria for uniqueness would result in rows 0 and 2 and rows 1 and 3 being in their own separate unique groups because they have the same values in column 1

I also want to be able to get the unique rows with respect to more than one column So if I wanted the unique rows with respect to column 0 AND 1 (now both column 0 and 1 are the mask) I would want this return....

[[0],[1],[2],[3]]

because when you use both columns as your uniqueness criteria there are four unique rows.

Is there an easy way to do this in numpy? Thanks.

Upvotes: 4

Views: 1016

Answers (4)

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

The numpy_indexed package (disclaimer: I am its author) provides a fully vectorized solution to these kind of problems:

import numpy_indexed as npi
# entire rows of a determine uniqueness
npi.unique(a)
# only second column determines uniqueness
npi.unique(a[:, 1])

And many more column types are possible as well.

Upvotes: 0

saulspatz
saulspatz

Reputation: 5261

Try using itertools.groupby

from itertools import groupby

data = [1,3,2,3,4,1,5,2,6,3,4]
data = [(x, k) for k, x in enumerate(data)]
data = sorted(data)

groups = []
for k, g in groupby(data, lambda x:x[0]):
    groups.append([x[1] for x in g])

print(groups)

Output is

[[0, 5], [2, 7], [1, 3, 9], [4, 10], [6], [8]]

Upvotes: 1

b4hand
b4hand

Reputation: 9770

Here's a custom solution that is certainly not going to be very performant since it does a lot of copying and directly iterates over the matrix:

def groupby(a, key_columns):
    from collections import defaultdict
    groups = defaultdict(list)
    for i, row in enumerate(a):
        groups[tuple(row[c] for c in key_columns)].append(i)
    return groups.values()

This assumes key_columns is a list or tuple that contains the corresponding columns for which you are interested in doing the grouping over. You could also do some argument inspection and promote a single index into a singleton list.

Running the following examples yields this output:

>>> groupby(a, [0])
[[0, 1], [2, 3]]
>>> groupby(a, [1])
[[0, 2], [1, 3]]

It also works for multiple key columns like you asked:

>>> groupby(a, [0, 1])
[[1], [2], [0], [3]]

Note in this case, since a defaultdict is used, the order of the values is not guaranteed. You could either sort the resulting values or use a collections.OrderedDict instead depending on how you plan to use the secondary indexes.

Upvotes: 1

heltonbiker
heltonbiker

Reputation: 27575

A possible way, using a loop:

import numpy

a = numpy.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1],
])


un = numpy.unique(a)

results = []

# could be a list comprehension
for val in un:  

    # zero-th column, change as needed:   
    indices = a[:,0] == val  

    results.append(numpy.argwhere(indices).flatten())

result = numpy.array(results)

print result

Depending on your needs and ultimate goals, you could use Pandas library.

It has a groupby method you could use like this:

import pandas
import numpy as np

a = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1],
])


df = pandas.DataFrame(a).groupby([0])  # zero-th column, change as needed

for key, group in df:
    print group.values

Notice that this returns the actual values, not the indices.

Upvotes: 0

Related Questions