mihalios
mihalios

Reputation: 828

Python: Counting identical rows in an array (without any imports)

For example, given:

import numpy as np
data = np.array(
    [[0, 0, 0],
    [0, 1, 1],
    [1, 0, 1],
    [1, 0, 1],
    [0, 1, 1],
    [0, 0, 0]])

I want to get a 3-dimensional array, looking like:

result = array([[[ 2.,  0.],
                 [ 0.,  2.]],

                [[ 0.,  2.],
                 [ 0.,  0.]]])

One way is:

for row in data
    newArray[ row[0] ][ row[1] ][ row[2] ] += 1

What I'm trying to do is the following:

for i in dimension1
   for j in dimension2
      for k in dimension3
          result[i,j,k] = (data[data[data[:,0]==i, 1]==j, 2]==k).sum()

This doesn't seem to work and I would like to achieve the desired result by sticking to my implementation rather than the one mentioned in the beginning (or using any extra imports, eg counter).

Thanks.

Upvotes: 5

Views: 183

Answers (4)

eric chiang
eric chiang

Reputation: 2755

Don't fear the imports. They're what make Python awesome.

If question assumes that you already have the result matrix.

import numpy as np
data = np.array(
    [[0, 0, 0],
     [0, 1, 1],
     [1, 0, 1],
     [1, 0, 1],
     [0, 1, 1],
     [0, 0, 0]]
)
result = np.zeros((2,2,2))

# range of each dim, aka allowable values for each dim
dim_ranges = zip(np.zeros(result.ndim), np.array(result.shape)-1)
dim_ranges
# Out[]:
#     [(0.0, 2), (0.0, 2), (0.0, 2)]

# Multidimentional histogram will effectively "count" along each dim
sums,_ = np.histogramdd(data,bins=result.shape,range=dim_ranges)
result += sums
result
# Out[]:
#     array([[[ 2.,  0.],
#             [ 0.,  2.]],
#
#            [[ 0.,  2.],
#             [ 0.,  0.]]])

This solution solves for any "result" ndarray, no matter what the shape. Additionally, it works fine even if your "data" ndarray has indices which are out-of-bounds for your result matrix.

Upvotes: 1

tobias_k
tobias_k

Reputation: 82899

The problem is that data[data[data[:,0]==i, 1]==j, 2]==k is not what you expect it to be.

Let's take this apart for the case (i,j,k) == (0,0,0)

data[:,0]==0 is [True, True, False, False, True, True], and data[data[:,0]==0] correctly gives us the lines where the first number is 0.

Now from those lines we get the lines where the second number is 0: data[data[:,0]==0, 1]==0, which gives us [True, False, False, True]. And this is the problem. Because if we take those indices from data, i.e., data[data[data[:,0]==0, 1]==0] we do not get the rows where the first and second number are 0, but the 0th and 3rd row instead:

In [51]: data[data[data[:,0]==0, 1]==0]
Out[51]: array([[0, 0, 0],
                [1, 0, 1]])

And if we now filter for the rows where the third number is 0, we get the wrong result w.r.t. the orignal data.

And that's why your approach does not work. For better methods, see the other answers.

Upvotes: 2

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250961

You can also use numpy.histogramdd for this:

>>> np.histogramdd(data, bins=(2, 2, 2))[0]
array([[[ 2.,  0.],
        [ 0.,  2.]],

       [[ 0.,  2.],
        [ 0.,  0.]]])

Upvotes: 4

Daniel
Daniel

Reputation: 19547

You can do something like the following

#Get output dimension and construct output array.
>>> dshape = tuple(data.max(axis=0)+1)
>>> dshape
(2, 2, 2)
>>> out = np.zeros(shape)

If you have numpy 1.8+:

out.flat[np.ravel_multi_index(data.T, dshape)]+=1

Else:

#Get indices and unique the resulting array
>>> inds = np.ravel_multi_index(data.T, dshape)
>>> inds, inverse = np.unique(inds, return_inverse=True)
>>> values = np.bincount(inverse)

>>> values
array([2, 2, 2])

>>> out.flat[inds] = values
>>> out
array([[[ 2.,  0.],
        [ 0.,  2.]],

       [[ 0.,  2.],
        [ 0.,  0.]]])

Numpy versions before numpy 1.7 do not have a add.at attribute and the top code will not work without it. As ravel_multi_index may not be the fastest algorithm ever you can look into taking the unique rows of a numpy array. In effect these two operations should be equivalent.

Upvotes: 2

Related Questions