Reputation: 1171
I am recently learning some HPC topics and get to know that modern C/C++ compilers is able to detect places where optimization is entitled and conduct it using corresponding techniques such as SIMD, loop unfolding, etc, especially under flag -O3
, with a tradeoff between runtime performance vs compile time and object file size.
Then it immediately occurred to me that CPython interprets and executes on-fly, so I assume it cannot afford to conduct those compiler features because compiling time for it is equivalently runtime, so I did a toy experiment below:
import time, random
n = 512
A = [[random.random() for _ in range(n)] for _ in range(n)]
B = [[random.random() for _ in range(n)] for _ in range(n)]
C = [[0] * n for _ in range(n)]
def matMul( A, B, C ):
""" C = A .* B """
for i in range(0, n - 4, 4):
for j in range(0, n - 4, 4):
for k in range(0, n - 4, 4):
C[i][j] = A[i][k] * B[k][j]
C[i + 1][j + 1] = A[i + 1][k + 1] * B[k + 1][j + 1]
C[i + 2][j + 2] = A[i + 2][k + 2] * B[k + 2][j + 2]
C[i + 3][j + 3] = A[i + 3][k + 3] * B[k + 3][j + 3]
C[i + 4][j + 4] = A[i + 4][k + 4] * B[k + 4][j + 4]
# return C
start = time.time()
matMul( A, B, C )
end = time.time()
print( f"Elapsed {end - start}" )
With the loop unfolding, the program finishes within 3 seconds, without it, it takes up to almost 20 seconds.
Does that mean one needs to pay attention and manually implement those opt techniques when writing Python code? or does Python offer the optimization under any special setting?
Upvotes: 2
Views: 135
Reputation: 13491
(See comments for why your code is not working, since that is not the real question).
As Josh already said, CPython does not optimization. So loop unfolding maybe useful (under the asumption that simply not using the loops in python, for example by letting numpy
do it for you, is not the solution).
But, I would like to point out that you are wrong about interpreter doing things "on the fly" and therefore not being able to optimize code.
Interpreters, never just interpret the text; even the oldest ones, and even less the modern ones. They work a bit like compilers, with differents stages: tokenization, then syntaxic analysis, that transform the text into trees of tokens, then semantic analysis that, for example, links identifiers to their definition (the reason why, even in an interpreter, contrarily to what we sometimes read here, it really doesn't matter, performance-wise, if your variable names have 1 char or 1000. The name have disappeared from the code before the execution is even started), do some typing checking or evaluation, if language is a typed language, etc.
At that stage, it would be possible for an interpreter to unroll the loops: it could see that the loop counter is not used outside the loop (so, no need to keep it), and decide the transform the tree to replace a for
by a repetition of the instructions).
And even the only stage that differenciates a compiler from an interpreter, the code generation stage, nowadays is often run by interpreters too, since most modern interpreters compile first for a virtual machine, and then run the virtual machine code. So at this stage also it is possible to perform some optimizations, like factorization of intermediates results and the kind.
So, that doesn't change the answer: no, CPython doesn't really optimize, and yes, with CPython, it is not surprising that unrolling the loops, sometimes, gain time (it, of course depends on what is in the loop. If the cost of increasing a counter and comparing it to a range is negligible before what is inside the loop, then unrolling it is a bit vain). But the reason is not "because python is an interpreter, and interpreters unlike compilers could not do optimization". Modern interpreters are basically compilers and virtual machines. And even more traditional ones interpret a tree representing the program, not the text code, and some optimization, like unrolling loops, could be done directly in the phase when they build that tree (that is as soon as the .py
file is read, even if not executed). It is just because CPython chooses not to.
(And CPython chooses not to, because python point is not to be optimal. If you need fast code, you shoudn't be coding it in python. You should be relying on numpy or the like, that are not coded in python. Or using your own extensions, with Python/C Api, or ctypes, or numba, or cython, or...)
Upvotes: 3
Reputation: 58392
Loop unrolling is useful because it can (1) reduce the overhead spent managing the loop and (2) at the assembly level, it let the processor run faster by eliminating branch penalties, keeping the instruction pipeline full, etc.
(2) doesn't really apply to an interpreted language implementation like Python - it's already doing lots of branching and decision-making at the assembly level. It might gain you with (1), but my gut feeling is that that time is often dwarfed by interpreter overhead. The golden rule of performance optimization is to first measure and confirm that the bottleneck is where you think it is.
Incidentally, your code has a bug:
C[i][j] = A[i][k] + B[k][j]
C[i + 1][j + 1] = A[i + 1][k + 1] + B[k + 1][j + 1]
C[i + 2][j + 2] = A[i + 2][k + 2] + B[k + 2][j + 2]
C[i + 3][j + 3] = A[i + 3][k + 3] + B[k + 3][j + 3]
C[i + 4][j + 4] = A[i + 4][k + 4] + B[k + 4][j + 4]
It processes cells (0, 0), (1, 1), (2, 2), (3, 3), and (4, 4) (even though (4, 4) will also be processed on the next iteration), but not (0, 1), (0, 2), (1, 0)... (That's the other reason for the golden rule of performance optimization: it's easy to make mistakes by optimizing code that doesn't need it.)
As @Mat said, the standard approach for Python in particular is to use NumPy, which uses an optimized C implementation.
All the above applies to CPython, the standard Python implementation. There are other Python implementations like Cython that offer their own optimizations; I'm less familiar with those.
Upvotes: 4