Jesse Lopez
Jesse Lopez

Reputation: 97

Mean over 2d numpy array with varying slices

I need to calculate the mean over the columns of a 2D numpy array where the slice per column varies.

For example, I have an array

    arr = np.arange(20).reshape(4, 5)

with the end index of the slice for each column mean defined as

    bot_ix = np.array([3, 2, 2, 1, 2])

The mean of the first column would then be

    arr[0:bot_ix[0], 0].mean()

What's the appropriate (i.e. Pythonic + efficient) way to do this? My array sizes are ~(50, 50K).

Upvotes: 3

Views: 78

Answers (3)

Divakar
Divakar

Reputation: 221594

You could use NumPy broadcasting -

mask = bot_ix > np.arange(arr.shape[0])[:,None]
out = np.true_divide(np.einsum('ij,ij->j',arr,mask),mask.sum(0))

Sample run to verify results -

In [431]: arr
Out[431]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [432]: bot_ix
Out[432]: array([3, 2, 2, 1, 2])

In [433]: np.true_divide(np.einsum('ij,ij->j',arr,mask),mask.sum(0))
Out[433]: array([ 5. ,  3.5,  4.5,  3. ,  6.5])

In [434]: [arr[0:item, i].mean() for i,item in enumerate(bot_ix)]
Out[434]: [5.0, 3.5, 4.5, 3.0, 6.5] # Loopy version to test out o/p

Upvotes: 3

Oliver W.
Oliver W.

Reputation: 13459

One way to do it, would be to let numpy compute the cumulative sum and then use fancy indexing in the newly generated array, like this:

np.true_divide(arr.cumsum(axis=0)[bot_ix-1,range(arr.shape[1])], bot_ix)

I won't make any assumptions about speed, as it is needlessly computing the cumulative sum for more elements than strictly required, but it depends entirely on your particular data.

Upvotes: 1

piRSquared
piRSquared

Reputation: 294358

A blend of Divakar and Oliver W.

mask = np.arange(arr.shape[0])[:, None] < bot_ix
(arr * mask).sum(0) / bot_ix.astype(float)

array([ 5. ,  3.5,  4.5,  3. ,  6.5])

Upvotes: 0

Related Questions