JMD
JMD

Reputation: 31

Rotating Large Arrays, Fastest Possible

I am relatively new to Python and looking for best optimized code for rotating large multi-dimensional arrays. In the following code I have a 16X600000 32bit floating point multi-dimensional array and according to the timer it takes about 30ms to rotate the contents on my quad core acer windows 8 tablet. I was considering using some Cython routines or something similar if it would be possible to reduce the time required to rotate the array.

Eventually the code will be used to store y-axis values for a high speed data plotting graph based around the VisPy package and the 32bit float array will be passed to an OpenGL routine. I would like to achieve less than 1ms if possible.

Any comments, recommendations or sample code would be much appreciated.

import sys, timeit
from threading import Thread
from PyQt4 import  QtGui
import numpy as np

m = 16              # Number of signals.
n = 600000          # Number of samples per signal.
y = 0.0 * np.random.randn(m, n).astype(np.float32) 
Running = False

class Window(QtGui.QWidget):
    def __init__(self):
        QtGui.QWidget.__init__(self)
        self.button = QtGui.QPushButton('Start', self)
        self.button.clicked.connect(self.handleButton)
        layout = QtGui.QVBoxLayout(self)
        layout.addWidget(self.button)

    def handleButton(self):
        global Running, thread, thrTest
        if Running == True:
            Running = False
            self.button.setText('Start')
            thrTest.isRunning = False
            print ('stop')
        else:
            Running = True
            self.button.setText('Stop')
            thrTest = testThread()
            thread = Thread(target=thrTest.run, daemon=True )
            thread.start()
            print ("Start")   

class testThread(Thread):
    def __init__(self):
        self.isRunning = True
    def run(self):
        print('Test: Thread Started')
        while self.isRunning == True:
            start_time = timeit.default_timer()
            y[:, :-1] = y[:, 1:]
            elapsed = timeit.default_timer() - start_time
            print ('Time (s)= ' + str(elapsed))
        print('Test: Closed Thread')


if __name__ == '__main__':
    app = QtGui.QApplication(sys.argv)
    window = Window()
    window.show()
    sys.exit(app.exec_())

Update

I guess there has been some confusion about exactly what I am trying to do so I will try to explain a little better.

The ultimate goal is to have a fast real-time data logging device which draws line on a graph representing the signal value. There will be multiple channels and a sampling rate of at least 1ms and as much recording time as possible. I have started with this VisPy example. The code in the example which writes the new data into the arrays and sends it to OpenGL is in the On_Timer function near the bottom. I have modified this code slightly to integrate the OpenGL canvas into a Qt gui and added some code to get data from an Arduino Mega through an ethernet socket.

Currently I can produce a real time graph of 16 lines with a sampling rate right about 1ms and a frame rate of around 30Hz with a recording time of about 14 seconds. If I try to increase the channel count or the recording length any more the program stops working as it cannot keep up with the flow of data coming in through the Ethernet port at 1ms.

The biggest culprit I can find for this is the time it takes to complete the data buffer shift using the y[:, :-1] = y[:, 1:] routine. Originally I submitted benchmark code where this function was being timed in the hope that someone knew of a way to do the same thing in a more efficient manner. The purpose of this line is to shift the entire array one index to the left, and then in my very next line of code I write new data to the first slot on the right.

Below you can see my modified graph update routine. First it takes the new data from the queue and unpacks into a temporary array, then it shifts the contents of the main buffer array, and finally it copies the new data into the last slot of the main array. Once the queue is empty it calls the update function so that OpenGL updates the display.

def on_timer(self, event):
    """Add some data at the end of each signal (real-time signals)."""
    k=1
    s = struct.Struct('>16H')
    AdrArray =  0.0 * np.random.randn(16,1).astype(np.float32)
    if not q.qsize() == 0:
        while q.qsize() > 0:
            print (q.qsize())
            print ('iin ' + str(datetime.datetime.now()))
            AdrArray[:,0]= s.unpack_from(q.get(), offset=4)
            y[:, :-1] = y[:, 1:]
            y[:, -1:] = .002*AdrArray 
            print ('out ' + str(datetime.datetime.now()))
        self.program['a_position'].set_data(y.ravel().astype(np.float32))
        self.update()

Upvotes: 2

Views: 897

Answers (3)

Nicolas Rougier
Nicolas Rougier

Reputation: 702

The realtime-signals.py example in glumpy implements a ring buffer and might help you:

https://github.com/glumpy/glumpy/blob/master/examples/realtime-signals.py

It is easily adaptable (and will be soon adapted) for vispy as well. The trick is to use a Fortran-like layout such that updating the 16 signals result in uploading a contiguous block of 16 floats to GPU. The example should handle yours 16 x 600,000 data if it fits into GPU memory (only tested 16 x 100,000 at 250 fps).

Note that because you have a limited horizontal screen resolution, this might prevent you from seeing the whole data (in the end only 1600 data will be displayed for each signal if your screen is 1600 pixels wide).

Upvotes: 1

Almar
Almar

Reputation: 5066

As some others have mentioned, I don't think you want to shift the array on the CPU. Moving all the elements in the array on each update will always be slow. I don't expect Cython to help here either, since Numpy will be quite optimal already.

Instead, you want to take care of the shifting in the vertex shader. An example of what I think is similar to what you want is here: http://vispy.org/examples/demo/gloo/realtime_signals.html

Edit: One approach is to consider your VBO a circular buffer. You add new values at the "oldest" location, and in the vertex shader you use an offset uniform to map the signals to the correct position. A similar technique is used in http://vispy.org/examples/demo/gloo/spacy.html

Upvotes: 1

hpaulj
hpaulj

Reputation: 231425

Do you really want this 'roll'? It's shifting the values left, filling with the last column

In [179]: y = np.arange(15).reshape(3,5)
In [180]: y[:,:-1]=y[:,1:]
In [181]: y
Out[181]: 
array([[ 1,  2,  3,  4,  4],
       [ 6,  7,  8,  9,  9],
       [11, 12, 13, 14, 14]])

For nonstandard 'roll' like this it is unlikely that there's anything faster.

np.roll has a different fill on the left

In [190]: np.roll(y,-1,1)
Out[190]: 
array([[ 1,  2,  3,  4,  0],
       [ 6,  7,  8,  9,  5],
       [11, 12, 13, 14, 10]])

For what it's worth, the core of roll is:

indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)

Your particular 'roll' could be reproduced with a similar 'take'

In [204]: indexes=np.concatenate([np.arange(1,y.shape[1]),[y.shape[1]-1]])
In [205]: y.take(indexes,1)

Your y[:,:-1]... is faster than roll because it does not create a new array; instead it just overwrites part of the existing one.

take accepts an out parameter, so this is possible:

y.take(indexes,1,y)

though speed wise this only helps with small arrays. For large ones your overwriting assignment is faster.

I'd also suggest looking at using the transpose, and rolling on the axis=0. For an order='C' array, the values of a row form a contiguous block.

The big time consumer is that you have to copy (nearly) all of the array from one location to another, either in a new array, or onto itself. If the data were in some sort of ring buffer, you could just change a pointer, without having to copy any data.

Upvotes: 1

Related Questions