Merlin
Merlin

Reputation: 25629

Create Numpy array without enumerating array

Starting with this:

x = range(30,60,2)[::-1];
x = np.asarray(x); x

array([58, 56, 54, 52, 50, 48, 46, 44, 42, 40, 38, 36, 34, 32, 30])

Create an array like this: (Notice, first item repeats) But if I can get this faster without the first item repeating, I can np.hstack first item.

[[58 58 56 54 52]
 [56 56 54 52 50]
 [54 54 52 50 48]
 [52 52 50 48 46]
 [50 50 48 46 44]
 [48 48 46 44 42]
 [46 46 44 42 40]
 [44 44 42 40 38]
 [42 42 40 38 36]
 [40 40 38 36 34]
 [38 38 36 34 32]
 [36 36 34 32 30]
 [34 34 32 30 None]
 [32 32 30 None None]
 [30 30 None None None]]

The code below works, want it faster without 'for' loop and enumerate.

arr = np.empty((0,5), int)

for i,e in enumerate(x):
    arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
    arr  = np.vstack((arr,arr2))

Upvotes: 1

Views: 728

Answers (4)

Divakar
Divakar

Reputation: 221514

Approach #1

Here's a vectorized approach using NumPy broadcasting -

N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))
arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
out = np.column_stack((x,arr2D))

Approach #2

Here's another one using hankel -

from scipy.linalg import hankel

N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))
out = np.column_stack((x,hankel(x_ext[:4], x_ext[3:]).T))

Runtime test

Here's a modified version of @Aaron's benchmarking script using an input format for this post identical to the one used for his post in that script for a fair benchmarking and focusing just on these two approaches -

upper_limit = 58 # We will edit this to vary the dataset sizes

print "Timings are : "
t = time()
for _ in range(1000):  #1000 iterations of @Aaron's soln.
    width = 3
    x = np.array(range(upper_limit,28,-2) + [float('nan')]*width)
    arr = np.empty([len(x)-width, width+2])
    arr[:,0] = x[:len(x)-width]
    for i in xrange(len(x)-width): 
        arr[i,1:] = x[i:i+width+1]
print(time()-t)

t = time()
for _ in range(1000): 
    N = 4 # width factor
    x_ext = np.array(range(upper_limit,28,-2) + [float('nan')]*(N-1))
    arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
    out = np.column_stack((x_ext[:len(x_ext)-N+1],arr2D))
print(time()-t)

Case #1 (upper_limit = 58 ) :

Timings are : 
0.0316879749298
0.0322730541229

Case #2 (upper_limit = 1058 ) :

Timings are : 
0.680443048477
0.124517917633

Case #3 (upper_limit = 5058 ) :

Timings are : 
3.28129291534
0.47504901886

Upvotes: 5

hpaulj
hpaulj

Reputation: 231335

Starting with Divaker's padded x

N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))

Since we aren't doing math on it, padding with None (which makes an object array) or np.nan (which makes a float) shouldn't make much difference.

The column stack could be eliminated with a little change to the indexing:

idx = np.r_[0,np.arange(N)] + np.arange(x_ext.size-N+1)[:,None]

this produces

array([[ 0,  0,  1,  2,  3],
       [ 1,  1,  2,  3,  4],
       [ 2,  2,  3,  4,  5],
       [ 3,  3,  4,  5,  6],
       [ 4,  4,  5,  6,  7],
       ...

so the full result is

x_ext[idx]

================

A different approach is to use striding to create a kind of rolling window.

as_strided = np.lib.stride_tricks.as_strided
arr2D = as_strided(x_ext, shape=(15,4), str‌​ides=(4,4))

This is one of easier applications of as_strided. shape is straight forward - the shape of the desired result (without the repeat column) (x.shape[0],N).

In [177]: x_ext.strides
Out[177]: (4,)

For 1d array of this type, the step to the next item is 4 bytes. If I reshape the array to 2d with 3 columns, the stride to the next row is 12 - 3*4 (3 offset).

In [181]: x_ext.reshape(6,3).strides
Out[181]: (12, 4)

Using strides=(4,4) means that the step to next row is just 4 bytes, one element in original.

as_strided(x_ext,shape=(8,4),strides=(8,4))

produces a 2 item overlap

array([[58, 56, 54, 52],
       [54, 52, 50, 48],
       [50, 48, 46, 44],
       [46, 44, 42, 40],
       ....

The potentially dangerous part of as_strided is that it is possible to create an array that samples memory outside of the original data buffer. Usually that appears as large random numbers where None appears in this example. It's the same sort of error that you would encounter if C code if you were careless in using array pointers and indexing.

The as_strided array is a view (the repeated values are not copied). So writing to that array could be dangerous. The column_stack with x will make a copy, replicating the repeated values as needed.

Upvotes: 3

josoler
josoler

Reputation: 1423

I suggest to contruct an initial matrix with equal columns and then use np.roll() to rotate them:

import numpy as np
import timeit as ti
import numpy.matlib

x = range(30,60,2)[::-1];
x = np.asarray(x);

def sol1():
    # Your solution, for comparison
    arr = np.empty((0,5), int)

    for i,e in enumerate(x):
        arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
        arr  = np.vstack((arr,arr2))
    return arr

def sol2():
    # My proposal
    x2 = np.hstack((x, [None]*3))
    mat = np.matlib.repmat(x2, 5, 1)
    for i in range(3):
        mat[i+2, :] = np.roll(mat[i+2, :], -(i+1))
    return mat[:,:-3].T


print(ti.timeit(sol1, number=100))
print(ti.timeit(sol2, number=100))

which guives:

0.026760146000015084
0.0038611710006080102

It uses a for loop but it only iterates over the shorter axis. Also, it should not be hard to adapt this code for other configurations instead of using hardcoded numbers.

Upvotes: 2

Aaron
Aaron

Reputation: 11075

I got about an order of magnitude faster by avoiding _stack() and only using floats...

edit: added @Divakar's post to time trial...

import numpy as np
from time import time

t = time()
for _ in range(1000):  #1000 iterations of my soln.
    width = 3
    x = np.array(range(58,28,-2) + [float('nan')]*width)
    arr = np.empty([len(x)-width, width+2])
    arr[:,0] = x[:len(x)-width]
    for i in xrange(len(x)-width): 
        arr[i,1:] = x[i:i+width+1]
print(time()-t)

t = time()
for _ in range(1000):  #1000 iterations of OP code
    x = range(30,60,2)[::-1];
    x = np.asarray(x)
    arr = np.empty((0,5), int)
    for i,e in enumerate(x):
        arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
        arr  = np.vstack((arr,arr2))
print(time()-t)

t = time()
for _ in range(1000): 
    x = np.array(range(58,28,-2))
    N = 4 # width factor
    x_ext = np.hstack((x,[None]*(N-1)))
    arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
    out = np.column_stack((x,arr2D))
print(time()-t)

prints out:

>>> runfile('...temp.py', wdir='...')
0.0160000324249
0.374000072479
0.0319998264313
>>> 

Upvotes: 3

Related Questions