Reputation: 23
I have a long list of index tuples (lots of duplicates), and a matrix of n by n indices. Each tuple represents a co-occurrence.
For instance:
a = np.zeros(shape=(indexCount,indexCount))
I have tried this:
for i1,i2 in coocPairs: #for instance (2374, 22003)
a[i1][i2}+=1 #takes way too long
or:
np.put(a,coocPairs,1) #which obviously does not increment
or also:
np.add(a,coocPairs,1) #which takes even longer.
In an ideal world, there would be a function that takes my list of tuples, and builds a cooccurrence matrix with it, but alas (doc. isnt very helpful). I think the solution could be more on the python side of the code, but I run out of ideas. Any help welcome. Thank you for your time,
Upvotes: 2
Views: 1387
Reputation: 114230
You can use a collections.Counter
to get the values that will actually appear in the matrix. This works because tuples are hashable. The assignment becomes pretty straightforward:
counts = collections.Counter(coocPairs)
ind = np.array(list(counts.keys())).T
a[ind[0], ind[1]] = list(counts.values())
As a rule, np.unique
with return_counts=True
is a standin for a Counter
. In this case, it is necessary to specify the axis, and to keep in mind that it will be one of the slower solutions:
ind, count = np.unique(coocPairs, return_counts=True, axis=0)
a[ind.T[0], ind.T[1]] = count
Instead, you can transform your pairs into linear indices in the raveled matrix:
ind = np.ravel_multi_index(tuple(np.array(coocPairs).T), a.shape)
Now you can do
ind, count = np.unique(ind, return_counts=True)
a.ravel()[ind] = count
Alternatively, you can use np.bincount
to get the counts much faster, or np.add.at
to avoid the up-front counting. The bincount
solution with the raveled index saves you the trouble of preallocating a
:
ind = np.ravel_multi_index(tuple(np.array(coocPairs).T), (n, n))
a = np.bincount(ind, minlength=n * n).reahape(n, n)
Upvotes: 2
Reputation: 53029
You could use np.add.at
np.add.at(a,tuple(coocPairs.T),1)
If that's not fast enough there are faster but less straight-forward np.bincount
based solutions. Those rely on flattened indexing using np.ravel_multi_index
.
Upvotes: 1