Dimitris
Dimitris

Reputation: 43

Python - multiprocessing

I want to complete the following task:

I have a "input" tsv file:

0   2   0
2   5   1
5   10  2
10  14  5

And i want to convert it in the following format:

0
0
1
1
1
2
2
2
2
2
5
5
5
5

I manage to do this with the following code: (Start is the first column of input file, stop is the second and depth is the third.)

def parse(i):
    out = []
    start = int(i[0])
    stop = int(i[1])
    depth = i[2]
    times = stop - start
    out += times * [depth]
    return(out)

signal = []
for i in tqdm(file):
    x = parse(i)
    signal.append(x)

with open('output.txt', 'w') as f:
    for item in signal[0]:
        f.write("%s\n" % item)

Although my input file has 16720973 lines and i have many files of those so i tried to make parallel processes to minimize execution time with the following code:

def parse(start, stop, depth):
    out = []
    times = int(stop) - int(start)
    out += times * [depth]
    return(out)

signal = []
poolv = multip.Pool(20)
x = [poolv.apply(parse, args=(i[0], i[1], i[2])) for i in tqdm(file)]
signal.append(x)
poolv.close()

But there was no difference in execution time and i think no multi process took place. Is there any mistake or a better way to solve this problem in order to minimize execution time?

Upvotes: 1

Views: 107

Answers (1)

constt
constt

Reputation: 2320

The docs for the apply(func[, args[, kwds]]) function are saying that

It blocks until the result is ready. Given this blocks, apply_async() is better suited for performing work in parallel. Additionally, func is only executed in one of the workers of the pool.

It means that you process lines of the input file sequentially blocking the pool until results get produced by one of the pool workers. The second thing is that I don't think you'll get a noticeable speed up trying to split the processing of different lines of the input file between pool workers. I'll tell you more, I think you'll slow down the whole process a bit by spending more time transferring data back and forth between processes than actually saving time on the processing itself as in your case it isn't a long-running job.

It would perhaps worth trying to parallelize processing of multiple input files, but taking into account the fact that they are usually stored on the same HDD it also won't give you any speed up.

BTW If you find this useful, here is how to make your processing using bash and awk in one line:

while read line; do echo $line | awk '{for(i = 0; i < $2 - $1; i++) print $3}'; done < input.txt > output.txt

This is your input.txt:

0   2   0
2   5   1
5   10  2
10  14  5

And this is what you get in the output.txt file:

0
0
1
1
1
2
2
2
2
2
5
5
5
5

Using this approach you can start a bunch of jobs in a terminal and see if it will speed up processing of multiple files.

Upvotes: 1

Related Questions