Reputation: 7799
If I have a numpy index like this....
import numpy as np
a = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1],
])
How would I find the index of the rows where the values in one or more specified columns are unique? What I mean is... If I specify a column as a "mask" how would I find the unique rows using that column as a mask? For example, if I wanted...
Unique rows with respect to column 0 (column 0 is the mask). I would want a return like this....
[[0,1],[2,3]]
because if you were to use column 0 as the criteria for uniqueness rows 0 and 1 would be in the same "unique group" and rows 2 and 3 would be in another "unique group" because they have the same value in column 0.
If I wanted the rows with respect to column 1 (column 1 is now the mask) I would like to have an output like this....
[[0,2],[1,3]]
because using column 1 as the criteria for uniqueness would result in rows 0 and 2 and rows 1 and 3 being in their own separate unique groups because they have the same values in column 1
I also want to be able to get the unique rows with respect to more than one column So if I wanted the unique rows with respect to column 0 AND 1 (now both column 0 and 1 are the mask) I would want this return....
[[0],[1],[2],[3]]
because when you use both columns as your uniqueness criteria there are four unique rows.
Is there an easy way to do this in numpy? Thanks.
Upvotes: 4
Views: 1016
Reputation: 10759
The numpy_indexed package (disclaimer: I am its author) provides a fully vectorized solution to these kind of problems:
import numpy_indexed as npi
# entire rows of a determine uniqueness
npi.unique(a)
# only second column determines uniqueness
npi.unique(a[:, 1])
And many more column types are possible as well.
Upvotes: 0
Reputation: 5261
Try using itertools.groupby
from itertools import groupby
data = [1,3,2,3,4,1,5,2,6,3,4]
data = [(x, k) for k, x in enumerate(data)]
data = sorted(data)
groups = []
for k, g in groupby(data, lambda x:x[0]):
groups.append([x[1] for x in g])
print(groups)
Output is
[[0, 5], [2, 7], [1, 3, 9], [4, 10], [6], [8]]
Upvotes: 1
Reputation: 9770
Here's a custom solution that is certainly not going to be very performant since it does a lot of copying and directly iterates over the matrix:
def groupby(a, key_columns):
from collections import defaultdict
groups = defaultdict(list)
for i, row in enumerate(a):
groups[tuple(row[c] for c in key_columns)].append(i)
return groups.values()
This assumes key_columns
is a list or tuple that contains the corresponding columns for which you are interested in doing the grouping over. You could also do some argument inspection and promote a single index into a singleton list.
Running the following examples yields this output:
>>> groupby(a, [0])
[[0, 1], [2, 3]]
>>> groupby(a, [1])
[[0, 2], [1, 3]]
It also works for multiple key columns like you asked:
>>> groupby(a, [0, 1])
[[1], [2], [0], [3]]
Note in this case, since a defaultdict
is used, the order of the values is not guaranteed. You could either sort the resulting values or use a collections.OrderedDict
instead depending on how you plan to use the secondary indexes.
Upvotes: 1
Reputation: 27575
A possible way, using a loop:
import numpy
a = numpy.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1],
])
un = numpy.unique(a)
results = []
# could be a list comprehension
for val in un:
# zero-th column, change as needed:
indices = a[:,0] == val
results.append(numpy.argwhere(indices).flatten())
result = numpy.array(results)
print result
Depending on your needs and ultimate goals, you could use Pandas library.
It has a groupby
method you could use like this:
import pandas
import numpy as np
a = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1],
])
df = pandas.DataFrame(a).groupby([0]) # zero-th column, change as needed
for key, group in df:
print group.values
Notice that this returns the actual values, not the indices.
Upvotes: 0