Reputation: 1263
Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.
For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations
:
import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation
I also have 5000 replicates of simulation results on each site:
sims = 5000
res = np.round(np.random.rand(sites, sims),1)
To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i
the correlation coefficient rho[j]
between the simulation results res
of each unique site pairs j
, and then taking the average across all possible pairs with relation i
:
rho_list = np.ones(categ)*99
for i in range(1, categ+1):
idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
for j in range(len(idr)): # loop through unique site pairs
comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
rho_list[i-1] = np.nanmean(rho)
Although this script works, but once I increase sites = 400
, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?
Upvotes: 4
Views: 979
Reputation: 221504
We can vectorize the innermost loop with j
iterator with some masking
to take care of the ragged nature of data being processed at each iteration of that loop. We can also take out slowish np.corrcoef
(inspired by this post
). Additionally, we can optimize few steps at the start of the outer-loop, specially the stacking steps, which could be the bottlenecks.
Thus, the complete code would reduce to something like this -
for i in range(1, categ+1):
r,c = np.where(relations==i)
A = res[r]
B = res[c]
mask0 = ~((A!=0) & (B!=0))
A[mask0] = 0
B[mask0] = 0
count = mask0.shape[-1] - mask0.sum(-1,keepdims=1)
A_mA = A - A.sum(-1, keepdims=1)/count
B_mB = B - B.sum(-1, keepdims=1)/count
A_mA[mask0] = 0
B_mB[mask0] = 0
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)
rho = np.einsum('ij,ij->i',A_mA, B_mB)/np.sqrt(ssA*ssB)
rho_list[i-1] = np.nanmean(rho)
Runtime tests
Case # 1: On given sample data, with sites = 100
In [381]: %timeit loopy_app()
1 loop, best of 3: 7.45 s per loop
In [382]: %timeit vectorized_app()
1 loop, best of 3: 479 ms per loop
15x+
speedup.
Case # 2: With sites = 200
In [387]: %timeit loopy_app()
1 loop, best of 3: 1min 56s per loop
In [388]: %timeit vectorized_app()
1 loop, best of 3: 1.86 s per loop
In [390]: 116/1.86
Out[390]: 62.36559139784946
62x+
speedup.
Case # 3: Finally with sites = 400
In [392]: %timeit vectorized_app()
1 loop, best of 3: 7.64 s per loop
This took 6hrs+
at OP's end with their loopy method.
From the timings, it became clear that vectorizing the inner loop was the key to getting noticeable speedups for large sites
.
Upvotes: 6