AudyCed
AudyCed

Reputation: 23

Numpy: Fastest Way to Fill Co-Occurrence Matrix

I have a long list of index tuples (lots of duplicates), and a matrix of n by n indices. Each tuple represents a co-occurrence.

For instance:

a = np.zeros(shape=(indexCount,indexCount))

I have tried this:

for i1,i2 in coocPairs:  #for instance (2374, 22003)
   a[i1][i2}+=1  #takes way too long

or:

np.put(a,coocPairs,1) #which obviously does not increment

or also:

np.add(a,coocPairs,1) #which takes even longer.

In an ideal world, there would be a function that takes my list of tuples, and builds a cooccurrence matrix with it, but alas (doc. isnt very helpful). I think the solution could be more on the python side of the code, but I run out of ideas. Any help welcome. Thank you for your time,

Upvotes: 2

Views: 1387

Answers (2)

Mad Physicist
Mad Physicist

Reputation: 114230

You can use a collections.Counter to get the values that will actually appear in the matrix. This works because tuples are hashable. The assignment becomes pretty straightforward:

counts = collections.Counter(coocPairs)
ind = np.array(list(counts.keys())).T
a[ind[0], ind[1]] = list(counts.values())

As a rule, np.unique with return_counts=True is a standin for a Counter. In this case, it is necessary to specify the axis, and to keep in mind that it will be one of the slower solutions:

ind, count = np.unique(coocPairs, return_counts=True, axis=0)
a[ind.T[0], ind.T[1]] = count

Instead, you can transform your pairs into linear indices in the raveled matrix:

ind = np.ravel_multi_index(tuple(np.array(coocPairs).T), a.shape)

Now you can do

ind, count = np.unique(ind, return_counts=True)
a.ravel()[ind] = count

Alternatively, you can use np.bincount to get the counts much faster, or np.add.at to avoid the up-front counting. The bincount solution with the raveled index saves you the trouble of preallocating a:

ind = np.ravel_multi_index(tuple(np.array(coocPairs).T), (n, n))
a = np.bincount(ind, minlength=n * n).reahape(n, n)

Upvotes: 2

Paul Panzer
Paul Panzer

Reputation: 53029

You could use np.add.at

np.add.at(a,tuple(coocPairs.T),1)

If that's not fast enough there are faster but less straight-forward np.bincount based solutions. Those rely on flattened indexing using np.ravel_multi_index.

Upvotes: 1

Related Questions