J_yang
J_yang

Reputation: 2812

A more efficient way to resizing Numpy Array into different size chunks

Sorry I am not sure how to put the title more accurately.

I have an array which I would like to evenly split into 3 arrays, then each array will have a different size which is the downsampled version of the original array via averaging.

Here is what I have:

import numpy as np
a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item
    b_per_band = []
    for i in range(item):
        each_slice = a[start + i * increment : start + (i + 1) * increment]
        b_per_band.append(each_slice.mean())
    b.append(b_per_band)
print(b)

Result:

[[21.0, 24.0, 27.0, 30.0, 33.0], [41.5, 45.5, 49.5, 53.5], [62.5, 68.5, 74.5]]

So I loop through bins, find out how much increment is for each step. Slice it accordingly and append the mean to the result.

But this is really ugly and most importantly has bad performance. As I am dealing with audio spectrum in my case, I would really like to learn a more efficient way to achieving the same result.

Any suggestion?

Upvotes: 4

Views: 296

Answers (2)

FObersteiner
FObersteiner

Reputation: 25544

Here's an option using np.add.reduceat:

a = np.arange(100)
n_in_bin = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
b = []
for i, sl in enumerate(split_index):
    n_bins = (sl[1]-sl[0])//n_in_bin[i]
    v = a[sl[0]:sl[0]+n_in_bin[i]*(n_bins)]
    sel_bins = np.linspace(0, len(v), n_in_bin[i]+1, True).astype(np.int)
    b.append(np.add.reduceat(v, sel_bins[:-1])/np.diff(sel_bins)))
print(b)
# [array([21., 24., 27., 30., 33.]) array([41.5, 45.5, 49.5, 53.5]) array([62.5, 68.5, 74.5])]

Some notes:

  • I changed the name bins to n_in_bin to clarify a bit.
  • using the floor division, you discard some data. Don't know if that's really important, just a hint.
  • the thing that should make this code faster, at least for large array sizes and 'chunks', is the use of np.add.reduceat. From my experience, this can be more efficient than looping.
  • if you have NaNs in your input data, check out this Q&A.

EDIT/REVISION

Since I'm also working on binning stuff at the moment, I tried a couple of things and ran timeit for the three methods shown so far, 'looped' for the one in the question, 'npredat' using np.add.reduceat, npsplit using np.split and got for 100000 iterations an avg time per iteration in [µs]:

a = np.arange(10000)
bins = [5, 4, 3]
split_index = [[20, 3900], [40, 5900], [60, 8000]]
-->
looped: 127.3, npredat: 116.9, npsplit: 135.5

vs.

a = np.arange(100)
bins = [5, 4, 3]
split_index = [[20, 39], [40, 59], [60, 80]]
-->
looped: 95.2, npredat: 103.5, npsplit: 100.5

However, results were slightly inconsistent for multiple runs of the 100k iterations and might differ for other machines than the one I tried this on. So my conclusion would be so far, that differences are marginal. All 3 options fall within the 1µs < domain > 1ms.

Upvotes: 2

Ardweaden
Ardweaden

Reputation: 887

What you're doing looks very weird to me, including the setup, which could probably use a different approach, making the problem much simpler.

However, using the same approach, you could try this:

b = []

for count, item in enumerate(bins):
    start = split_index[count][0]
    end = split_index[count][1]
    increment = (end - start) // item

    b_per_band = np.mean(np.split(a[start:start + item * increment], item),axis=1)

    b.append(b_per_band)

Upvotes: 0

Related Questions