iterating through a huge loop efficiently using python

Question

I have 100000 images and I need to get the vectors for each image

imageVectors = []
for i in range(100000):
    fileName = "Images/" + str(i) + '.jpg'
    imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )

getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to

for i in range(100000):
    A = callFunction(i)  //a complex function that takes 1 sec for each call

The things that I have already tried are: (only the pseduo-code is given here)

1) Using numpy vectorizer:

def callFunction1(i):
   return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))

2)Using python map:

def callFunction1(i):
    return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))

3) Using python multiprocessing:

import multiprocessing
try:
   cpus = multiprocessing.cpu_count()
except NotImplementedError:
   cpus = 4   # arbitrary default

pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))

4) Using multiprocessing in a different way:

from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()

results = []
for i in range(4):
    results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()

All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.

The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?

iterating through a huge loop efficiently using python

Answers (1)

Related Questions