anon
anon

Reputation:

Indexing ranges of columns of array when only the indexes of the ranges are given

I am looking for an efficient way of indexing the columns of a numpy array with several ranges, when only the indexes of the desired ranges are given.

For example, given the following array, and a range size r_size=3:

import numpy as np
arr = np.arange(18).reshape((2,9))

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16, 17]])

This would mean that there are a total of 3 sets of ranges [r0, r1, r2] whose elements in the array are distributed as:

[[r0_00, r0_01, r0_02, r1_00, r1_01, r1_02, r2_00, r2_01, r2_02]
 [r0_10, r0_11, r0_12, r1_10, r1_11, r1_12, r2_10, r2_11, r2_12]]

So if I want to access the ranges r0 and r2, then I would obtain:

arr    = np.arange(18).reshape((2,9))
r_size = 3
ranges = [0, 2]
# --------------------------------------------------------
# Line that index arr, with the variable ranges... Output:
# --------------------------------------------------------
array([[ 0,  1,  2,  6,  7,  8],
       [ 9, 10, 11, 15, 16, 17]])

The fastest way that I've found is the following:

import numpy as np
from itertools import chain

arr    = np.arange(18).reshape((2,9))
r_size = 3
ranges = [0,2]

arr[:, list(chain(*[range(r_size*x,r_size*x+r_size) for x in ranges]))]

array([[ 0,  1,  2,  6,  7,  8],
       [ 9, 10, 11, 15, 16, 17]])

But I am not sure if it can be improved in terms of speed.

Thanks in advance!

Upvotes: 1

Views: 109

Answers (2)

Kevin
Kevin

Reputation: 3348

You will inevitably need to copy the data to get the desired result in a contiguous array. Although to make it efficient I would suggest trying to minimize the number of times you copy the data. Any kind of reshaping operation can be expressed with np.lib.stride_tricks.as_strided.

Assume the original array contains 64-bit integers, then each element is 8 bytes arranged in some shape:

import numpy as np
arr = np.arange(18).reshape((2,9))
arr.shape, arr.strides

output:

((2, 9), (72, 8))

so each column skips 8 bytes and each row skips 72 bytes. arr.reshape(len(arr), -1, r_size) can be expressed as:

np.lib.stride_tricks.as_strided(arr, (2,3,3), (72,24,8))

output:

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]]])

And arr.reshape(len(arr), -1, r_size)[:, ranges] can be expressed as:

np.lib.stride_tricks.as_strided(arr, (2,2,3), (72,24*2,8))

Output:

array([[[ 0,  1,  2],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [15, 16, 17]]])

So far, we have only changes the metadata of the array which means that no data has been copied. This operation has a near-zero performance cost. But to get the final array you will need to copy the data somehow:

np.lib.stride_tricks.as_strided(arr, (2,2,3), (72,24*2,8)).reshape(len(arr), -1)

Output:

array([[ 0,  1,  2,  6,  7,  8],
       [ 9, 10, 11, 15, 16, 17]])

This is not a generalized solution, but it might give you some ideas nonetheless on how to optimize.

Unfortunately, my timings do not back these claims but it is intuitive still and worth testing for some larger arrays.

Upvotes: 1

Ivan
Ivan

Reputation: 40618

You could start by splitting the array up in r_size chunks:

>>> splits = np.split(arr, r_size, axis=1)
[array([[ 0,  1,  2],
        [ 9, 10, 11]]), 
 array([[ 3,  4,  5],
        [12, 13, 14]]), 
 array([[ 6,  7,  8],
        [15, 16, 17]])]

Stack with np.stack and select the correct ranges:

>>> stack = np.stack(splits)[ranges]
array([[[ 0,  1,  2],
        [ 9, 10, 11]],

       [[ 6,  7,  8],
        [15, 16, 17]]])

And concatenate horizontally with np.hstack or np.concantenate on axis=1:

>>> np.stack(stack)
array([[ 0,  1,  2,  6,  7,  8],
       [ 9, 10, 11, 15, 16, 17]])

Overall this looks like:

>>> np.hstack(np.stack(np.split(arr, r_size, axis=1))[ranges])
array([[ 0,  1,  2,  6,  7,  8],
       [ 9, 10, 11, 15, 16, 17]])

Alternatively, you can work with np.reshapes exclusively which will be faster:

Initial reshape:

>>> arr.reshape(len(arr), -1, r_size)
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]]])

Indexing with ranges:

>>> arr.reshape(len(arr), -1, r_size)[:, ranges]
array([[[ 0,  1,  2],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [15, 16, 17]]])

Then, reshaping back to the final form:

>>> arr.reshape(len(arr),  -1, r_size)[:, ranges].reshape(len(arr), -1)

Upvotes: 1

Related Questions