Reputation: 6869
Let's say I have a NumPy array:
x = np.array([3, 9, 2, 1, 5, 4, 7, 7, 8, 6])
If I sum up this array, I get 52
. What I need is a way to split it up starting from left to right into roughly n
chunks where n
is chosen by the user. Essentially, the splits occur in a greedy fashion. So, for some number of chunks n
, the first n - 1
chunks must each sum up to at least 52/n
and they must be consecutive indices taken from left to right.
So, if n = 2
then the first chunk would consist of the first 7 elements:
chunk[0] = x[:7] # [3, 9, 2, 1, 5, 4, 7], sum = 31
chunk[1] = x[7:] # [7, 8, 6], sum = 21
Notice that the first chunk wouldn't consist of the first 6 elements only since the sum would be 24
which is less than 52/2 = 26
. Also, notice that the number of elements in each chunk is allowed to vary as long as the sum criteria is met. Finally, it is perfectly fine for the last chunk to not be close to 52/2 = 26
since the other chunk(s) may take more.
However, the output that I need is a two column array that contains the start index in the first column and the (exclusive) stop index in the second column:
[[0, 7],
[7, 10]]
If n = 4
, then the first 3 chunks need to each sum up to at least 52/4 = 13
and would look like this:
chunk[0] = x[:3] # [3, 9, 2], sum = 14
chunk[1] = x[3:7] # [1, 5, 4], sum = 17
chunk[2] = x[7:9] # [7, 8], sum = 15
chunk[3] = x[9:] # [6], sum = 6
And the output that I need would be:
[[0, 3],
[3, 7],
[7, 9],
[9, 10]
So, one naive approach using for loops might be:
ranges = np.zeros((n_chunks, 2), np.int64)
ranges_idx = 0
range_start_idx = start
sum = 0
for i in range(x.shape[0]):
sum += x[i]
if sum > x.sum() / n_chunks:
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = min(
i + 1, x.shape[0]
) # Exclusive stop index
# Reset and Update
range_start_idx = i + 1
ranges_idx += 1
sum = 0
# Handle final range outside of for loop
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = x.shape[0]
if ranges_idx < n_chunks - 1:
left[ranges_idx:] = x.shape[0]
return ranges
I am looking for a nicer vectorized solution.
Upvotes: 2
Views: 1472
Reputation: 6869
I found inspiration in a similar question that was answered:
def func(x, n):
out = np.zeros((n, 2), np.int64)
cum_arr = x.cumsum() / x.sum()
idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
out[1:, 0] = idx # Fill the first column with start indices
out[:-1, 1] = idx # Fill the second column with exclusive stop indices
out[-1, 1] = x.shape[0] # Handle the stop index for the final chunk
return out
Update
To cover the pathological case, we need to be a little more precise and do something like:
def func(x, n, truncate=False):
out = np.zeros((n_chunks, 2), np.int64)
cum_arr = x.cumsum() / x.sum()
idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
out[1:, 0] = idx # Fill the first column with start indices
out[:-1, 1] = idx # Fill the second column with exclusive stop indices
out[-1, 1] = x.shape[0] # Handle the stop index for the final chunk
# Handle pathological case
diff_idx = np.diff(idx)
if np.any(diff_idx == 0):
row_truncation_idx = np.argmin(diff_idx) + 2
out[row_truncation_idx:, 0] = x.shape[0]
out[row_truncation_idx-1:, 1] = x.shape[0]
if truncate:
out = out[:row_truncation_idx]
return out
Upvotes: 3
Reputation: 998
Here is a solution that doesn't iterate over all elements:
def fun2(array, n):
min_sum = np.sum(array) / n
cumsum = np.cumsum(array)
i = -1
count = min_sum
out = []
while i < len(array)-1:
j = np.searchsorted(cumsum, count)
out.append([i+1, j+1])
i = j
if i < len(array):
count = cumsum[i] + min_sum
out[-1][1] -= 1
return np.array(out)
For the two test cases it produces the results you expected. HTH
Upvotes: 1