Reputation: 75
I have a Numpy structured array that is sorted by the first column:
x = array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])
I need to merge records (sum the values of the second column) where
x[n][0] == x[n + 1][0]
In this case, the desired output would be:
x = array([(2, 11), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])
What's the best way to achieve this?
Upvotes: 3
Views: 443
Reputation: 231540
Dicakar's
answer cast in structured array form:
In [500]: x=np.array([(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)], dtype=[('recod', '<u8'), ('count', '<u4')])
Find unique values and count duplicates:
In [501]: unqA, idx=np.unique(x['recod'], return_inverse=True)
In [502]: cnt = np.bincount(idx, x['count'])
Make a new structured array and fill the fields:
In [503]: x1 = np.empty(unqA.shape, dtype=x.dtype)
In [504]: x1['recod'] = unqA
In [505]: x1['count'] = cnt
In [506]: x1
Out[506]:
array([(25, 1), (37, 5), (47, 1), (59, 2)],
dtype=[('recod', '<u8'), ('count', '<u4')])
There is a recarray
function that builds an array from a list of arrays:
In [507]: np.rec.fromarrays([unqA,cnt],dtype=x.dtype)
Out[507]:
rec.array([(25, 1), (37, 5), (47, 1), (59, 2)],
dtype=[('recod', '<u8'), ('count', '<u4')])
Internally it does the same thing - build an empty array of the right size and dtype, and then loop over over the dtype fields. A recarray is just a structured array in a specialized array subclass wrapper.
There are two ways of populating a structured array (especially with a diverse dtype) - with a list of tuples as you did with x
, and field by field.
Upvotes: 2
Reputation: 114911
pandas
makes this type of "group-by" operation trivial:
In [285]: import pandas as pd
In [286]: x = [(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)]
In [287]: df = pd.DataFrame(x)
In [288]: df
Out[288]:
0 1
0 25 1
1 37 3
2 37 2
3 47 1
4 59 2
In [289]: df.groupby(0).sum()
Out[289]:
1
0
25 1
37 5
47 1
59 2
You probably won't want the dependency on pandas if this is the only operation you need from it, but once you get started, you might find other useful bits in the library.
Upvotes: 2
Reputation: 221624
You can use np.unique
to get an ID array for each element in the first column and then use np.bincount
to perform accumulation on the second column elements based on the IDs -
In [140]: A
Out[140]:
array([[25, 1],
[37, 3],
[37, 2],
[47, 1],
[59, 2]])
In [141]: unqA,idx = np.unique(A[:,0],return_inverse=True)
In [142]: np.column_stack((unqA,np.bincount(idx,A[:,1])))
Out[142]:
array([[ 25., 1.],
[ 37., 5.],
[ 47., 1.],
[ 59., 2.]])
You can avoid np.unique
with a combination of np.diff
and np.cumsum
which might help because np.unique
also does sorting internally, which is not needed in this case as the input data is already sorted. The implementation would look something like this -
In [201]: A
Out[201]:
array([[25, 1],
[37, 3],
[37, 2],
[47, 1],
[59, 2]])
In [202]: unq1 = np.append(True,np.diff(A[:,0])!=0)
In [203]: np.column_stack((A[:,0][unq1],np.bincount(unq1.cumsum()-1,A[:,1])))
Out[203]:
array([[ 25., 1.],
[ 37., 5.],
[ 47., 1.],
[ 59., 2.]])
Upvotes: 3
Reputation: 77991
You can use np.reduceat
. You just need to populate where x[:, 0]
changes which is equivalent to non zero indices of np.diff(x[:,0])
shifted by one plus the initial index 0:
>>> i = np.r_[0, 1 + np.nonzero(np.diff(x[:,0]))[0]]
>>> a, b = x[i, 0], np.add.reduceat(x[:, 1], i)
>>> np.vstack((a, b)).T
array([[25, 1],
[37, 5],
[47, 1],
[59, 2]])
Upvotes: 1