Parallelizing for loop in Python

Question

I coded a neural network that runs really slow, so I was hoping to speed it up a little by parallelizing a certain loop.

I am not sure about the implementation and how the GIL works and if its relevant for me.

The code looks like this:

class net():
    def train(inputs)
        ...
        for tset in batch:
            self.backprop(tset, learn_rate)

    def backprop(inputs):
        ...
        for c, w in zip(n.inconnections, range(len(n.inconnections))):
            c.current_error += n.current_error * n.inconnectionweights[w]
            n.icweightupdates[w] += - learn_rate * n.current_error * c.activation
        n.biasupdate += -learn_rate * n.current_error

The loop in train() is the one I want to parallelize since batch contains a set of training samples (20) that could be processed independently.

user3666197 · Accepted Answer

_{While "AGILE" evangelisation gang beats their drums louder and louder,}

Never start coding before you know it makes sense to do so:

Why?

You can spend indeed an infinite amount of time in coding a thing, that has no sense to even start with, whereas a reasonable amount of time spent on due system-engineering will help you determine, what steps are reasonable and what are not.

So, let's start from a common ground: what is you r motivation for a parallelisation -- Yes, PERFORMANCE.

If we both agree on this, let's review this new domain of expertise.

It is not hard to sketch a code, which is awfully bad at this.

It is increasingly important to become able do both understand and to design code, that can harness the contemporary hardware infrastructures to their maximum performance possible.

parallel-processing in general is intended for many other reasons, not just the performance -- your case, as proposed above, is at the end actually a "just"-[CONCURRENT]-type of process-scheduling. Not true-[PARALLEL]? Sure, if it were not at a machine having more than 21 CPU-cores, HyperThreading off and all Operating-System processes stopped for the moment of being, until all 20 examples got processed there and back, throughout all the global minimiser loops got converged.

And imagine, you try to run just 20 Machine Learning inputs ( examples ), while the real-world DataSETs have many thousands ( well beyond many hundreds thousands in my problem-domain ) examples to process, so will never get into such an extreme in a true-[PARALLEL] process-scheduling, using this way.

Best start to understand the Amdahl's Law in its full context first ( an overhead-naive formulation does a bad service in this for freshmen experimentation -- better master in full-details also the Criticism section of the updated post on both the overheads and resources-bound limits first, before voting for going into a "parallelisation at any cost" -- even if a half-dozen "wannabe-gurus" advise you to do so and so many PR-motivated media today shout on you to go-parallel ).

Next, read this about details and differences, that may come just from using better or more appropriate tools ( Once having understood the Amdahl's Law principal Speedup ceilings, the speedup of +512x will shock you and set the gut feeling, what makes and what does not make sense ). This is relevant for every performance bottleneck review and re-engineering. Most Neural Networks spend immense amount of time ( not due a poor performance of a code, but because of immense DataSET-sizes ) in re-runing Feed-Forward + Back-Propagation phases, where vectorised code is harder to design, than using a plain python code, but this is where performance could be gained, not lost.

Python can rather use the smart vectorisation tools from numpy plus may design your code to harness minimum-overheads by systematic use of a view, instead of repetitively losing performance on making memory-allocations once passing copies of dense-matrices in Neural Networks, if python-implementations are to show its possible speeds.

There are so many cases, when just-coding will generate an awfully bad processing performance at whatever craftmanship the code may pour in, whereas a bit of mathematics will show a smart re-engineering of the computing process, and process-performance has suddenly jumped a few orders of magnitude ( yes, 10x, 100x faster just by using human brain and critical thinking ).

Yes, it is hard to design a fast code, but nobody has promised you a dinner to be for free, did he?

Last, but not least:

never let all [CONCURRENT]-tasks do an exactly the same job, the less to always re-repeat it:

for c, w in zip( n.inconnections, range( len( n.inconnections ) ) ):
    ...

this syntax is easy to code, but introduces a re-calculated zip() processing into each and every "wanted-to-get-accelerated" task. No. Indeed a bad idea, if performance is still in mind.

Parallelizing for loop in Python

Answers (2)

Never start coding before you know it makes sense to do so:

Last, but not least:

Related Questions