Reputation: 9858
Given a numpy array M
of shape (r, c)
e.g.
M = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
[13, 14, 15]]) # r = 5; c = 3
and a one-dimensional array a
of length r
, containing integers that can vary from 0 to k-1
, e.g.
a = np.array([0, 0, 2, 1, 0]) # k = 4
I want to use the values in a
to select rows from M
to get an intermediate result like this:
array([
[[1, 2, 3], [4, 5, 6], [13, 14, 15]] # rows of M where a == 0
[[10, 11, 12]], # rows of M where a == 1
[[7, 8, 9]] # rows of M where a == 2
[] # rows of M where a == 3 (there are none)
])
(I don't need this intermediate array, but am just showing it for illustration.) The returned result would be a (k, c)
array with the column-wise means from this array:
array([[ 6., 7., 8.], # means where a == 0
[10., 11., 12.], # means where a == 1
[ 7., 8., 9.], # etc.
[nan, nan, nan]])
I can do this with
np.array([M[a == i].mean(axis=0) for i in range(k)])
but is there a way (hopefully faster for large r
and k
) that purely uses numpy methods instead of using a for loop to make a list (which will then have to be converted back to an array)?
Upvotes: 4
Views: 3236
Reputation: 221574
Approach #1
NumPy solution leveraging BLAS powered matrix-multiplication -
In [46]: m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)
In [47]: m.dot(M)/m.sum(1,keepdims=True)
Out[47]:
array([[ 6., 7., 8.],
[10., 11., 12.],
[ 7., 8., 9.],
[nan, nan, nan]])
For performance boost on larger arrays, convert the mask to float and use np.bincount
for counting mask elements -
In [108]: m.astype(float).dot(M)/np.bincount(a,minlength=k)[:,None]
Out[108]:
array([[ 6., 7., 8.],
[10., 11., 12.],
[ 7., 8., 9.],
[nan, nan, nan]])
Approach #2
Another with np.bincount
applied across columns -
n = M.shape[1]
out = np.empty((k,n))
for i in range(n):
out[:,i] = np.bincount(a,M[:,i], minlength=k)
out /= np.bincount(a,minlength=k)[:,None]
Benchmarking all approaches listed in @Ehsan's post.
Using benchit
package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions :
import benchit
funcs = [m1,m2,m3,m4,m5,m6]
in_ = {(nr,nc,k):[np.random.randint(100,size=(nr,nc)), np.random.randint(0,k,size=nr), k] for nr in [100,1000,10000,100000] for nc in [10,100,200] for k in [10, 100, 1000]}
t = benchit.timings([m1, m2, m3, m4, m5, m6], in_, multivar=True, input_name=['nrows','ncols','k'])
t.plot(logx=True, save='timings.png')
Since, m6
is the original one. Let's get the speedups w.r.t.
to it :
t.speedups(m6).plot(logx=True, logy=True, save='speedups.png')
There's no clear winner across all datasets. Nevertheless, m5
seems to be good with small ncols, while m1
seems better on bigger ncols.
Upvotes: 5
Reputation: 12407
A pandas solution (Thanks to @Quang for suggestion on a better code):
pd.DataFrame(M).groupby(a).mean().reindex(np.arange(k)).values
#[[ 6. 7. 8.]
# [10. 11. 12.]
# [ 7. 8. 9.]
# [nan nan nan]]
Comparison using @Divakar's benchit
:
#@Ehsan's solution
def m1(M, a):
return pd.DataFrame(M).groupby(a).mean().reindex(np.arange(k)).values
#@Divakar's solution 1
def m2(M, a):
m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)
return m.dot(M)/m.sum(1,keepdims=True)
#@Mad Physicist's solution 1
def m3(M, a):
index = np.argsort(a)
a = a[index]
M = M[index]
split = np.flatnonzero(np.diff(np.r_[-1, a]))
means = np.add.reduceat(M, split, axis=0) / np.diff(np.r_[split, a.size])[:, None]
result = np.full((k, M.shape[1]), np.nan)
result[a[split], :] = means
return result
#@Mad Physicist's solution 2
def m4(M, a):
u, ind, cnt = np.unique(a, return_inverse=True, return_counts=True)
ind = np.argsort(ind)
split = np.cumsum(np.r_[0, cnt[:-1]])
result = np.full((k, M.shape[1]), np.nan)
result[u, :] = np.add.reduceat(M[ind], split, axis=0) / cnt[:, None]
return result
#@Divakar's solution 2
def m5(M, a):
n = M.shape[1]
out = np.empty((k,n))
for i in range(n):
out[:,i] = np.bincount(a,M[:,i], minlength=k)
out /= np.bincount(a,minlength=k)[:,None]
return out
#@OP's baseline
def m6(M, a):
return np.array([M[a == i].mean(axis=0) for i in range(k)])
funcs = [m1,m2,m3,m4,m5,m6]
in_ = {n:[np.random.randint(100,size=(n,10)), np.random.randint(0,k,size=n)] for n in [100,1000,10000,100000]}
I think depending on number of your clusters and columns and rows, different approaches can be more efficient.
for k=10
for k=100
for k=1000
for k=100 and in_ = {n:[np.random.randint(100,size=(n,1000)), np.random.randint(0,k,size=n)] for n in [100,1000,10000]}
(M
with 1000 columns and smaller range of rows)
Upvotes: 5
Reputation: 114330
A pure numpy solution would require finding the sort order that puts the rows of M
into groups. For example:
index = np.argsort(a)
You could then find the split points:
split = np.flatnonzero(np.diff(np.r_[-1, a[index]]))
And then apply np.add.reduceat
to the result to get sums:
sums = np.add.reduceat(M[index], split, axis=0)
Another diff
gives the divisors:
lengths = np.diff(np.r_[split, a.size])
You can fill in the result using the indices encoded in a
:
result = np.full((k, M.shape[1]), np.nan)
result[a[index][split], :] = sums / lengths[:, None]
All in all:
def m3(M, a, k):
index = np.argsort(a)
a = a[index]
M = M[index]
split = np.flatnonzero(np.diff(np.r_[-1, a]))
means = np.add.reduceat(M, split, axis=0) / np.diff(np.r_[split, a.size])[:, None]
result = np.full((k, M.shape[1]), np.nan)
result[a[split], :] = means
return result
The solution can be made more concise using np.unique
, which pre-computes the split points and indices for you. You will have to argsort the indices to invert them, and apply cumulative sum to get the correct split points:
def m4(M, a, k):
u, ind, cnt = np.unique(a, return_inverse=True, return_counts=True)
ind = np.argsort(ind)
split = np.cumsum(np.r_[0, cnt[:-1]])
result = np.full((k, M.shape[1]), np.nan)
result[u, :] = np.add.reduceat(M[ind], split, axis=0) / cnt[:, None]
return result
Upvotes: 5