aveek
aveek

Reputation: 180

iterating through a huge loop efficiently using python

I have 100000 images and I need to get the vectors for each image

imageVectors = []
for i in range(100000):
    fileName = "Images/" + str(i) + '.jpg'
    imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL ) 

getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to

for i in range(100000):
    A = callFunction(i)  //a complex function that takes 1 sec for each call

The things that I have already tried are: (only the pseduo-code is given here)

1) Using numpy vectorizer:

def callFunction1(i):
   return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))

2)Using python map:

def callFunction1(i):
    return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))

3) Using python multiprocessing:

import multiprocessing
try:
   cpus = multiprocessing.cpu_count()
except NotImplementedError:
   cpus = 4   # arbitrary default

pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))

4) Using multiprocessing in a different way:

from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()

results = []
for i in range(4):
    results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()

All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.


The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?

Upvotes: 3

Views: 191

Answers (1)

Roland Smith
Roland Smith

Reputation: 43505

The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.

BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.

Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.

Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.

In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.

For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.

Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.

Upvotes: 2

Related Questions