Reputation: 9858

Calculate column-wise mean of rows of a numpy matrix selected using values from another array

Given a numpy array M of shape (r, c) e.g.

M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12],
              [13, 14, 15]])  # r = 5; c = 3

and a one-dimensional array a of length r, containing integers that can vary from 0 to k-1, e.g.

a = np.array([0, 0, 2, 1, 0])  # k = 4

I want to use the values in a to select rows from M to get an intermediate result like this:

array([
       [[1, 2, 3], [4, 5, 6], [13, 14, 15]]  # rows of M where a == 0
       [[10, 11, 12]],                       # rows of M where a == 1
       [[7, 8, 9]]                           # rows of M where a == 2
       []                                    # rows of M where a == 3 (there are none)
      ])

(I don't need this intermediate array, but am just showing it for illustration.) The returned result would be a (k, c) array with the column-wise means from this array:

array([[ 6.,  7.,  8.],   # means where a == 0
   [10., 11., 12.],       # means where a == 1
   [ 7.,  8.,  9.],       # etc.
   [nan, nan, nan]])

I can do this with

np.array([M[a == i].mean(axis=0) for i in range(k)])

but is there a way (hopefully faster for large r and k) that purely uses numpy methods instead of using a for loop to make a list (which will then have to be converted back to an array)?

Upvotes: 4

Answers (3)

Divakar

Reputation: 221574

Approach #1

NumPy solution leveraging BLAS powered matrix-multiplication -

In [46]: m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)

In [47]: m.dot(M)/m.sum(1,keepdims=True)
Out[47]: 
array([[ 6.,  7.,  8.],
       [10., 11., 12.],
       [ 7.,  8.,  9.],
       [nan, nan, nan]])

For performance boost on larger arrays, convert the mask to float and use np.bincount for counting mask elements -

In [108]: m.astype(float).dot(M)/np.bincount(a,minlength=k)[:,None]
Out[108]: 
array([[ 6.,  7.,  8.],
       [10., 11., 12.],
       [ 7.,  8.,  9.],
       [nan, nan, nan]])

Approach #2

Another with np.bincount applied across columns -

n = M.shape[1]
out = np.empty((k,n))
for i in range(n):
    out[:,i] = np.bincount(a,M[:,i], minlength=k) 
out /= np.bincount(a,minlength=k)[:,None]

Benchmarking

Benchmarking all approaches listed in @Ehsan's post.

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions :

import benchit
funcs = [m1,m2,m3,m4,m5,m6]
in_ = {(nr,nc,k):[np.random.randint(100,size=(nr,nc)), np.random.randint(0,k,size=nr), k] for nr in [100,1000,10000,100000] for nc in [10,100,200] for k in [10, 100, 1000]}
t = benchit.timings([m1, m2, m3, m4, m5, m6], in_, multivar=True, input_name=['nrows','ncols','k'])
t.plot(logx=True, save='timings.png')

Since, m6 is the original one. Let's get the speedups w.r.t. to it :

t.speedups(m6).plot(logx=True, logy=True, save='speedups.png')

There's no clear winner across all datasets. Nevertheless, m5 seems to be good with small ncols, while m1 seems better on bigger ncols.

Upvotes: 5

Ehsan

Reputation: 12407

A pandas solution (Thanks to @Quang for suggestion on a better code):

pd.DataFrame(M).groupby(a).mean().reindex(np.arange(k)).values
#[[ 6.  7.  8.]
# [10. 11. 12.]
# [ 7.  8.  9.]
# [nan nan nan]]

Comparison using @Divakar's benchit:

#@Ehsan's solution
def m1(M, a):
  return pd.DataFrame(M).groupby(a).mean().reindex(np.arange(k)).values

#@Divakar's solution 1
def m2(M, a):
  m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)
  return m.dot(M)/m.sum(1,keepdims=True)

#@Mad Physicist's solution 1
def m3(M, a):
    index = np.argsort(a)
    a = a[index]
    M = M[index]
    split = np.flatnonzero(np.diff(np.r_[-1, a]))
    means = np.add.reduceat(M, split, axis=0) / np.diff(np.r_[split, a.size])[:, None]
    result = np.full((k, M.shape[1]), np.nan)
    result[a[split], :] = means
    return result

#@Mad Physicist's solution 2
def m4(M, a):
    u, ind, cnt = np.unique(a, return_inverse=True, return_counts=True)
    ind = np.argsort(ind)
    split = np.cumsum(np.r_[0, cnt[:-1]])
    result = np.full((k, M.shape[1]), np.nan)
    result[u, :] = np.add.reduceat(M[ind], split, axis=0) / cnt[:, None]
    return result

#@Divakar's solution 2
def m5(M, a):
  n = M.shape[1]
  out = np.empty((k,n))
  for i in range(n):
      out[:,i] = np.bincount(a,M[:,i], minlength=k) 
  out /= np.bincount(a,minlength=k)[:,None]
  return out

#@OP's baseline
def m6(M, a):
  return np.array([M[a == i].mean(axis=0) for i in range(k)])

funcs = [m1,m2,m3,m4,m5,m6]
in_ = {n:[np.random.randint(100,size=(n,10)), np.random.randint(0,k,size=n)] for n in [100,1000,10000,100000]}

I think depending on number of your clusters and columns and rows, different approaches can be more efficient.

for k=10

for k=100

for k=1000

for k=100 and in_ = {n:[np.random.randint(100,size=(n,1000)), np.random.randint(0,k,size=n)] for n in [100,1000,10000]} (M with 1000 columns and smaller range of rows)

Upvotes: 5

Mad Physicist

Reputation: 114330

A pure numpy solution would require finding the sort order that puts the rows of M into groups. For example:

index = np.argsort(a)

You could then find the split points:

split = np.flatnonzero(np.diff(np.r_[-1, a[index]]))

And then apply np.add.reduceat to the result to get sums:

sums = np.add.reduceat(M[index], split, axis=0)

Another diff gives the divisors:

lengths = np.diff(np.r_[split, a.size])

You can fill in the result using the indices encoded in a:

result = np.full((k, M.shape[1]), np.nan)
result[a[index][split], :] = sums / lengths[:, None]

All in all:

def m3(M, a, k):
    index = np.argsort(a)
    a = a[index]
    M = M[index]
    split = np.flatnonzero(np.diff(np.r_[-1, a]))
    means = np.add.reduceat(M, split, axis=0) / np.diff(np.r_[split, a.size])[:, None]
    result = np.full((k, M.shape[1]), np.nan)
    result[a[split], :] = means
    return result

The solution can be made more concise using np.unique, which pre-computes the split points and indices for you. You will have to argsort the indices to invert them, and apply cumulative sum to get the correct split points:

def m4(M, a, k):
    u, ind, cnt = np.unique(a, return_inverse=True, return_counts=True)
    ind = np.argsort(ind)
    split = np.cumsum(np.r_[0, cnt[:-1]])
    result = np.full((k, M.shape[1]), np.nan)
    result[u, :] = np.add.reduceat(M[ind], split, axis=0) / cnt[:, None]
    return result

Upvotes: 5

Calculate column-wise mean of rows of a numpy matrix selected using values from another array

Answers (3)

Benchmarking

Related Questions