user3552432
user3552432

Reputation:

Python: Fastest Way to Traverse 2-D Array

I have a 2-D float array, and want to count the number of fields greater that a threshold in each column and store it in a 1-D Array. Current I am using the following code but it takes a long of time (Array size: 27000 by 27000). Can anyone tell me a faster way to this.

Following is my code:

for Column in range(len(CorrelationData)):
BestMatchCount[0][Column] = sum(i >= Threshold for i in CorrelationData[:][Column])

Upvotes: 0

Views: 2848

Answers (4)

Andrew
Andrew

Reputation: 124

The best, yet probably not the easiest way to increase the performance of this would be to follow the divide and conquer methodology. Create a child tread to iterate through each column, and have the thread do the needed calculations. Then once all threads have finished and returned their value compile the values to find your result.

EDIT: added some sample code. The variable 2DArray represents what would be the 2d-array from OP's question.

import threading

class Worker(threading.Thread):
    def __init__(self, threadID, name, column):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.column = column

        self.start()

    def run(self):
        # Do work here.

        threadLock.acquire()
        # Increment threshold counter here.
        counter += self.doWork(self.column)
        threadLock.release()

    def doWork(self, colum):
        count = 0
        for row in column:
            # Test if number is above threshold.

threadLock = threading.Lock()
threads = []
counter = 0

tid = 0
for column in 2DArray:
    threads.append(Worker(tid, 'thread-{0}'.format(tid), column))
    tid += 1

for thread in threads:
    thread.join()

Upvotes: 1

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250921

You should use pure NumPy for this, Python's for-loops will slow it down:

>>> arr = np.random.rand(1000, 1000)                                   
>>> %timeit [sum(i >= 0.5 for i in arr.T[c]) for c in xrange(len(arr))]
1 loops, best of 3: 1.58 s per loop
>>> %timeit np.sum(arr >= 0.5, axis=0)
1000 loops, best of 3: 1.53 ms per loop

Upvotes: 6

user3187811
user3187811

Reputation: 267

first sort your array using following command

a = [5, 2, 3, 1, 4]
a.sort()

Then you can use if command. once you hit the threshold value you can stop the searching.

This might speedup your searching little bit. sorting is much faster in python.

for reverse your sorted list you can use following command

def reverse_numeric(x, y):
      return y - x
sorted([5, 2, 4, 1, 3], cmp=reverse_numeric)
[5, 4, 3, 2, 1]

https://wiki.python.org/moin/HowTo/Sorting for more information

Upvotes: 0

Thomas of the Scotts
Thomas of the Scotts

Reputation: 60

Unfortunately summation of columns is pretty strictly O(n^2) unless you cheat using vectorized processing or some other parallel method. The implicit vectorization of R might prove to be an easy solution if you're language flexible. If not then I think parallelization with some number of threads taking successive columns as they finish might be the fastest way to go (as Andrew suggested before me).

Upvotes: 0

Related Questions