Reputation: 103
I am trying to figure out how to speed up the following Python code.
Basically, the code builds the matrix of outter products of a matrix C
and stores it as block diagonal sparse matrix.
I use numpy.repeat()
to build indices into the block diagonal.
Profiling the code revealed that calls to numpy.repeat()
take about 50 % of the execution time.
import numpy as np
import scipy.sparse as spspar
L = 1000
K = 100
C = np.random.randn(L,K)
# From the matrix of outter products of C and store in block_diagonal
# sparse matrix
CCt = np.einsum('ij...,i...->ij...',C,C)
# create indices into the block diagonal sparse coo matrix
i = np.tile(np.tile(np.arange(K),K),L) + K*np.repeat(np.arange(L),K*K)
j = np.tile(np.repeat(np.arange(K),K),L) + K*np.repeat(np.arange(L),K*K)
# store as block diagonal sparse coo matrix
BlckCCt = spspar.coo_matrix((CCt.flatten(),(j,i)),shape=(K*K*L,K*K*L))
Initially, I was building the sparse matrix as follows
BlckCCt = spspar.block_diag(CCt,"coo")
This is waay too slow and memory intensive.
Thanks for any input.
Edit: I compared @hjpaul's suggestions using ipython timeit. Here's what I can report
timeit K*np.repeat(np.arange(L),K*K)
10 loops, best of 3: 82.1 ms per loop
timeit (np.zeros((K*K,),int)+np.arange(L)[:,None]).flatten()*K
10 loops, best of 3: 89.9 ms per loop
timeit np.tile(np.arange(L)*K,K*K).reshape(K*K,L).T.flatten()
10 loops, best of 3: 85.5 ms per loop
So it looks like they all take about the same amount (I'm new to ipython profiling so perhaps I'm not comparing them the right way).
Upvotes: 1
Views: 1988
Reputation: 231510
Just for reference, your
CCt = np.einsum('ij...,i...->ij...',C,C)
is the same as
CCt1=C[:,None,:]*C[:,:,None]
producing a (L,K,K)
array. For my smaller test case np.einsum
is 2x faster.
sparse.block_diag
converts each submatrix to coo
, and passes them to sparse.bmat
. bmat
collects the rows
, cols
, data
of all the sub matrices into a big arrays similar to your j, i
, and calls coo_matrix
with those.
Doing ipython
timeit
on various pieces, I agree that K*np.repeat(np.arange(L),K*K)
is the slowest block of code. Much slower, for example, than the tile
piece.
Since you are doing the same repeat
for i
and j
, can't you do it just once, and use that variable twice?
kk= K*np.repeat(np.arange(L),K*K)
ii=np.tile(np.tile(np.arange(K),K),L) + kk
jj=np.tile(np.repeat(np.arange(K),K),L) + kk
I'll look at that piece some more, but that's a start.
Here's a slight improvement (20%) on the repeat
:
(np.zeros((K*K,),int)+np.arange(L)[:,None]).flatten()*K
even better (>2x)
np.tile(np.arange(L)*K,K*K).reshape(L,K*K).T.flatten()
I moved *K
to the smaller arange(L)
, and use faster tile
. .T.flatten
takes care of changing the order.
As per comment, the reshape should be (K*K,L)
. I was testing with values where that didn't matter. And relative speeds of these alternatives vary with the relative sizes of K
and L
.
The tiling for the first part of i
and j
is optional, if kk
(the 2nd part) is shape (L,K,K) (like CCt
). Whether it saves time is unclear. Striding is more complicated than with the fully tiled version (0,4,0)
v. (4,)
.)
i = (np.arange(K)[None,None,:] + kk.reshape(L,K,K)).flatten()
j = (np.arange(K)[None,:,None] + kk.reshape(L,K,K)).flatten()
we could do the same with kk
k1 = K*np.arange(L)[:,None,None]
np.arange(K)[None,None,:] + k1
is (L,1,K), so we need to tile it
i = np.tile( np.arange(K)[None,None,:] + k1, (1,K,1)).flatten()
j = np.tile( np.arange(K)[None,:,None] + k1, (1,1,K)).flatten()
Another way to generate these arrays would be to use np.ix_
to reshape the ranges, and then just sum values.
i = np.sum(np.ix_(K*np.arange(L), np.arange(K), np.zeros(K)))
j = np.sum(np.ix_(K*np.arange(L), np.zeros(K), np.arange(K)))
(add .flatten
as needed). I've tested this on small dimensions and it looks right. I don't know about speed.
Upvotes: 4