Reputation: 35521
I need to load ~100k files with vectors and aggregate the content in a numpy array. This process takes ~3mins so I want to speed it up. I tried to use gevent to speed it up, but I could not gain any speedup.
I read that one should use async calls to speed up IO calls and not multiprocessing. I further read that gevent is the recommended library. I wrote an example to download images, where I could see a huge improvement in speed. Here is a simplified version of my code
def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
file_paths = # list of filenames
numpy_array = numpy.ones([len(file_paths), file_size])
pool = gevent.pool.Pool(poolsize)
for i, list_file_path_tuples in enumerate(chunks(file_paths, CHUNK_SIZE)):
gevent_results = pool.map(numpy.load, list_file_path_tuples)
pool.join()
for i_chunk, result in enumerate(gevent_results):
index = i * CHUNK_SIZE + i_chunk
data = result['arr_0']
numpy_array[index] = data
Using chunks is necessary, because otherwise I would have all the vectors twice in memory.
Is there an issue in my code or do I use the wrong approach?
Upvotes: 3
Views: 778
Reputation: 8831
Have you profiled your code and have an idea where the hotspot is? If it is not computing, it's probably just the disk IO. I doubt you get a performance boost by tricks on the IO logic. In the end it's the sequential disk access that might be the limit. If you do have a RAID system it might makes sense to have multiple threads reading from the disk, but you could do that with python standard threads. Try to ramp up from 1 to a few and measure along the way to find the sweet spot.
The reason why you saw an improvement with gevent downloading images in parallel is that the network IO throughput can be improved a lot with multiple connections. A single network connection can hardly saturate the network bandwidth when the remote server is not directly attached to your network device. Whereas a single disk IO operation can easily saturate the disk throughput.
Upvotes: 4