Reputation: 6858
Question in short
Given a large sparse csr_matrix A
and a numpy array B
, what is the fastest way to construct a numpy matrix C
, such that C[i,j] = sum(A[k,j])
for all k
where B[k] == i
?
Details of question
I found a solution to do this, but I am not really content with how long it takes. I will first explain the problem, then my solution, then show my code, and then show my timings.
Problem
I am working on a clustering algorithm in Python, and I'd like to speed it up. I have a sparse csr_matrix pam
, in which I have per person per article how many items they bought of that article. Furthermore, I have a numpy array clustering
, in which the cluster that person belongs to is denoted. Example:
pam pam.T clustering
article person
p [[1 0 0 0] a
e [0 2 0 0] r [[1 0 1 0 0 0] [0 0 0 0 1 1]
r [1 1 0 0] t [0 2 1 0 0 0]
s [0 0 1 0] i [0 0 0 1 0 1]
o [0 0 0 1] c [0 0 0 0 1 2]]
n [0 0 1 2]] l
e
What I like to calculate is acm
: the amount of items all people in one cluster together bought. This amounts to, for every column i
in acm
, adding those columns p
of pam.T
for which clustering[p] == i
.
acm
cluster
a
r [[2 0]
t [3 0]
i [1 1]
c [0 3]]
l
e
Solution
First, I create another sparse matrix pcm
, in which I indicate per element [i,j]
if person i
is in cluster j
. Result (when cast to dense matrix):
pcm
cluster
p [[False True]
e [False True]
r [ True False]
s [False True]
o [False True]
n [ True False]]
Next, I matrix multiply pam.T
with pcm
to get the matrix that I want.
Code I wrote the following program to test the duration of this method in practice.
import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit
def _clustering2pcm(clustering):
'''
Converts a clustering (np array) into a person-cluster matrix (pcm)
'''
N_persons = clustering.size
m_person = np.arange(N_persons)
clusters = np.unique(clustering)
N_clusters = clusters.size
m_data = [True] * N_persons
pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))
return pcm
def pam_clustering2acm():
'''
Convert a person-article matrix and a given clustering into an
article-cluster matrix
'''
global clustering
global pam
pcm = _clustering2pcm(clustering)
acm = csr_matrix.transpose(pam).dot(pcm).todense()
return acm
if __name__ == '__main__':
global clustering
global pam
N_persons = 200000
N_articles = 400
N_shoppings = 400000
N_clusters = 20
m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))
clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
print timeit(pam_clustering2acm, number = 100)
Timing
It turns out that for these 100 runs, I need 5.1 seconds. 3.6 seconds of these are spent on creating pcm
. I have the feeling there could be a faster way to calculate this matrix without creating a temporary sparse matrix, but I don't see one without looping. Is there a faster way of construction?
EDIT
After Martino's answer, I have tried to implement the loop over clusters and slicing algorithm, but that is even slower. It takes now 12.5 seconds to calculate acm
100 times, of which 4.1 seconds remain if I remove the line acm[:,i] = pam[p,:].sum(axis = 0)
.
def pam_clustering2acm_loopoverclusters():
global clustering
global pam
N_articles = pam.shape[1]
clusters = np.unique(clustering)
N_clusters = clusters.size
acm = np.zeros([N_articles, N_clusters])
for i in clusters:
p = np.where(clustering == i)[0]
acm[:,i] = pam[p,:].sum(axis = 0)
return acm
Upvotes: 3
Views: 137
Reputation: 67427
This is about 50x faster than your _clustering2pcm
function:
def pcm(clustering):
n = clustering.size
data = np.ones((n,), dtype=bool)
indptr = np.arange(n+1)
return csr_matrix((data, clustering, indptr))
I haven't looked at the source code, but when you pass the CSR constructor the (data, (rows, cols))
structure, it is almost certainly using that to create a COO matrix, then converting it to CSR. Because your matrix is so simple, it is very easy to put the actual CSR matrix description arrays together as above, and skip all of that.
This almost cuts your execution time down by three:
In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop
In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop
In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True
Upvotes: 4
Reputation: 11
I refer you to the scipy.sparse docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). Where they say the row slicing is efficient (as opposed to column splicing), so it is probably better to to stick to the non-transposed matrix. Then if you browse down there is a sum function where the axis can be specified. It is probably better to use the methods that come with your object as they are likely to use compiled code. This is at the cost of looping through clusters (of which I am assuming there are not too many).
Upvotes: 1