physicalattraction
physicalattraction

Reputation: 6858

Speeding up summing certain columns in a matrix

Question in short

Given a large sparse csr_matrix A and a numpy array B, what is the fastest way to construct a numpy matrix C, such that C[i,j] = sum(A[k,j]) for all k where B[k] == i?

Details of question

I found a solution to do this, but I am not really content with how long it takes. I will first explain the problem, then my solution, then show my code, and then show my timings.

Problem I am working on a clustering algorithm in Python, and I'd like to speed it up. I have a sparse csr_matrix pam, in which I have per person per article how many items they bought of that article. Furthermore, I have a numpy array clustering, in which the cluster that person belongs to is denoted. Example:

      pam                pam.T              clustering
    article              person
p [[1 0 0 0]      a 
e  [0 2 0 0]      r [[1 0 1 0 0 0]         [0 0 0 0 1 1]
r  [1 1 0 0]      t  [0 2 1 0 0 0]
s  [0 0 1 0]      i  [0 0 0 1 0 1]
o  [0 0 0 1]      c  [0 0 0 0 1 2]]
n  [0 0 1 2]]     l
                  e

What I like to calculate is acm: the amount of items all people in one cluster together bought. This amounts to, for every column i in acm, adding those columns p of pam.T for which clustering[p] == i.

    acm
  cluster
a 
r [[2 0]
t  [3 0]
i  [1 1]
c  [0 3]]
l
e

Solution First, I create another sparse matrix pcm, in which I indicate per element [i,j] if person i is in cluster j. Result (when cast to dense matrix):

        pcm
      cluster
p [[False  True]
e  [False  True]
r  [ True False]
s  [False  True]
o  [False  True]
n  [ True False]]

Next, I matrix multiply pam.T with pcm to get the matrix that I want.

Code I wrote the following program to test the duration of this method in practice.

import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit

def _clustering2pcm(clustering):
    '''
    Converts a clustering (np array) into a person-cluster matrix (pcm)
    '''        
    N_persons = clustering.size
    m_person = np.arange(N_persons)
    clusters = np.unique(clustering)
    N_clusters = clusters.size
    m_data = [True] * N_persons

    pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))

    return pcm

def pam_clustering2acm():
    '''
    Convert a person-article matrix and a given clustering into an 
    article-cluster matrix
    '''
    global clustering
    global pam

    pcm = _clustering2pcm(clustering)    
    acm = csr_matrix.transpose(pam).dot(pcm).todense()
    return acm

if __name__ == '__main__':
    global clustering
    global pam

    N_persons = 200000
    N_articles = 400
    N_shoppings = 400000
    N_clusters = 20

    m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
    m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
    m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
    pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))

    clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
    print timeit(pam_clustering2acm, number = 100)

Timing It turns out that for these 100 runs, I need 5.1 seconds. 3.6 seconds of these are spent on creating pcm. I have the feeling there could be a faster way to calculate this matrix without creating a temporary sparse matrix, but I don't see one without looping. Is there a faster way of construction?

EDIT

After Martino's answer, I have tried to implement the loop over clusters and slicing algorithm, but that is even slower. It takes now 12.5 seconds to calculate acm 100 times, of which 4.1 seconds remain if I remove the line acm[:,i] = pam[p,:].sum(axis = 0).

def pam_clustering2acm_loopoverclusters():
    global clustering
    global pam

    N_articles = pam.shape[1]
    clusters = np.unique(clustering)
    N_clusters = clusters.size
    acm = np.zeros([N_articles, N_clusters])
    for i in clusters:
        p = np.where(clustering == i)[0]
        acm[:,i] = pam[p,:].sum(axis = 0)

    return acm

Upvotes: 3

Views: 137

Answers (2)

Jaime
Jaime

Reputation: 67427

This is about 50x faster than your _clustering2pcm function:

def pcm(clustering):
    n = clustering.size
    data = np.ones((n,), dtype=bool)
    indptr = np.arange(n+1)
    return csr_matrix((data, clustering, indptr))

I haven't looked at the source code, but when you pass the CSR constructor the (data, (rows, cols)) structure, it is almost certainly using that to create a COO matrix, then converting it to CSR. Because your matrix is so simple, it is very easy to put the actual CSR matrix description arrays together as above, and skip all of that.

This almost cuts your execution time down by three:

In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop

In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop

In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True

Upvotes: 4

Martino
Martino

Reputation: 11

I refer you to the scipy.sparse docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). Where they say the row slicing is efficient (as opposed to column splicing), so it is probably better to to stick to the non-transposed matrix. Then if you browse down there is a sum function where the axis can be specified. It is probably better to use the methods that come with your object as they are likely to use compiled code. This is at the cost of looping through clusters (of which I am assuming there are not too many).

Upvotes: 1

Related Questions