Reputation:
I have a 2-D float array, and want to count the number of fields greater that a threshold in each column and store it in a 1-D Array. Current I am using the following code but it takes a long of time (Array size: 27000 by 27000). Can anyone tell me a faster way to this.
Following is my code:
for Column in range(len(CorrelationData)):
BestMatchCount[0][Column] = sum(i >= Threshold for i in CorrelationData[:][Column])
Upvotes: 0
Views: 2848
Reputation: 124
The best, yet probably not the easiest way to increase the performance of this would be to follow the divide and conquer methodology. Create a child tread to iterate through each column, and have the thread do the needed calculations. Then once all threads have finished and returned their value compile the values to find your result.
EDIT: added some sample code. The variable 2DArray represents what would be the 2d-array from OP's question.
import threading
class Worker(threading.Thread):
def __init__(self, threadID, name, column):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.column = column
self.start()
def run(self):
# Do work here.
threadLock.acquire()
# Increment threshold counter here.
counter += self.doWork(self.column)
threadLock.release()
def doWork(self, colum):
count = 0
for row in column:
# Test if number is above threshold.
threadLock = threading.Lock()
threads = []
counter = 0
tid = 0
for column in 2DArray:
threads.append(Worker(tid, 'thread-{0}'.format(tid), column))
tid += 1
for thread in threads:
thread.join()
Upvotes: 1
Reputation: 250921
You should use pure NumPy for this, Python's for-loops will slow it down:
>>> arr = np.random.rand(1000, 1000)
>>> %timeit [sum(i >= 0.5 for i in arr.T[c]) for c in xrange(len(arr))]
1 loops, best of 3: 1.58 s per loop
>>> %timeit np.sum(arr >= 0.5, axis=0)
1000 loops, best of 3: 1.53 ms per loop
Upvotes: 6
Reputation: 267
first sort your array using following command
a = [5, 2, 3, 1, 4]
a.sort()
Then you can use if command. once you hit the threshold value you can stop the searching.
This might speedup your searching little bit. sorting is much faster in python.
for reverse your sorted list you can use following command
def reverse_numeric(x, y):
return y - x
sorted([5, 2, 4, 1, 3], cmp=reverse_numeric)
[5, 4, 3, 2, 1]
https://wiki.python.org/moin/HowTo/Sorting for more information
Upvotes: 0
Reputation: 60
Unfortunately summation of columns is pretty strictly O(n^2) unless you cheat using vectorized processing or some other parallel method. The implicit vectorization of R might prove to be an easy solution if you're language flexible. If not then I think parallelization with some number of threads taking successive columns as they finish might be the fastest way to go (as Andrew suggested before me).
Upvotes: 0