Reputation: 1713
I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map()
, but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map()
, then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
Upvotes: 4
Views: 3075
Reputation: 38217
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by @ubomb.
Upvotes: 1