CppLearner
CppLearner

Reputation: 17040

If Python VM executes byte-code, where are the byte-code for non-import modules?

I understand a few things based on the following link (I could be wrong!):

http://docs.python.org/2/glossary.html#term-bytecode

  1. .pyc is a cached file and is only generated if the module is imported somewhere else

  2. .pyc is to help loading performance, not execution performance.

  3. running python foo.py does not generate foo.pyc unless foo is imported somewhere.

  4. Python has a bytecode compiler (used to generate .pyc)

  5. Python's virtual machine executes byte-code.

So, when I run python foo.py, if foo.py is not imported anywhere, does Python actually create an in-memory bytecode?

The missing .pyc seems to break the idea of Python VM.

This question is extended to code execution in Python interpreter (running python in the terminal). I believe CPython (or just about any language implementation) can't do pure interpretation.

I think the core of the question is: Does the VM actually read the .pyc file? I assume VM loads the .pyc into the execution environment.

Upvotes: 1

Views: 645

Answers (3)

nneonneo
nneonneo

Reputation: 179462

Python is incapable of directly executing source code (unlike some other scripting languages which do ad hoc parsing, e.g. Bash). All Python source code must be compiled to bytecode, no matter what the source is. (This includes e.g. code run through eval and exec). Generating bytecode is rather expensive because it involves running a parser, so caching the bytecode (as .pyc) speeds up module loading by avoiding the parsing phase.

The difference between import foo and python foo.py is simply that the latter doesn't cache the bytecode that is generated.

Upvotes: 2

Samy Vilar
Samy Vilar

Reputation: 11130

Interesting ... the first thing I did was call for --help

$ python --help
usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Options and arguments (and corresponding environment variables):
-B     : don't write .py[co] files on import; also PYTHONDONTWRITEBYTECODE=x
...

and the first option I see is to disable automatic pyc and pyo file generation on import, though thats probably cause its alphabetical order.

lets run some tests

$ echo "print 'hello world'" > test.py
$ python test.py 
hello world
$ ls test.py*
test.py
$ python -c "import test"
hello world
$ ls test.py*
test.py     test.pyc

so it only generated the pyc file when it was imported.

now in order to check which files are being used I'll use OS X dtruss similar to linux truss to do a full trace ...

$ echo '#!/bin/sh 
 python test.py' > test.sh 
$ chmod a+x test.sh
$ sudo dtruss -a ./test.sh 2>&1 | grep "test.py*"
975/0x5713:    244829       6      3 read(0x3, "#!/bin/sh \npython test.py\n\b\0", 0x50)         = 26 0
975/0x5713:    244874       4      2 read(0xFF, "#!/bin/sh \npython test.py\n\b\0", 0x1A)        = 26 0
977/0x5729:    658694       6      2 readlink("test.py\0", 0x7FFF5636E360, 0x400)        = -1 Err#22
977/0x5729:    658726      10      6 getattrlist("/Users/samyvilar/test.py\0", 0x7FFF7C0EE510, 0x7FFF5636C6E0 = 0 0
977/0x5729:    658732       3      1 stat64("test.py\0", 0x7FFF5636DCB8, 0x0)        = 0 0
977/0x5729:    658737       5      3 open_nocancel("test.py\0", 0x0, 0x1B6)      = 3 0
977/0x5729:    658760       4      2 stat64("test.py\0", 0x7FFF5636E930, 0x1)        = 0 0
977/0x5729:    658764       5      2 open_nocancel("test.py\0", 0x0, 0x1B6)      = 3 0

from the looks of it python did not even touch test.pyc file at all!

$ echo '#!/bin/sh 
 python -c "import test"' > test.sh
$ chmod a+x test.sh
$ sudo dtruss -a ./test.sh 2>&1 | grep "test.py*"
$ sudo dtruss -a ./test.sh 2>&1 | grep "test.py*"
1028/0x5d74:    654642       8      5 open_nocancel("test.py\0", 0x0, 0x1B6)         = 3 0
1028/0x5d74:    654683       8      5 open_nocancel("test.pyc\0", 0x0, 0x1B6)        = 4 0
$

well thats interesting it looks like it opened test.py then test.pyc

what happens when we delete the pyc file.

$ rm test.pyc
$ sudo dtruss -a ./test.sh 2>&1 | grep "test.py*"
1058/0x5fd6:    654151       7      4 open_nocancel("/Users/samyvilar/test.py\0", 0x0, 0x1B6)        = 3 0
1058/0x5fd6:    654191       6      3 open_nocancel("/Users/samyvilar/test.pyc\0", 0x0, 0x1B6)       = -1 Err#2
1058/0x5fd6:    654234       7      3 unlink("/Users/samyvilar/test.pyc\0", 0x1012B456F, 0x1012B45E0)        = -1 Err#2
1058/0x5fd6:    654400     171    163 open("/Users/samyvilar/test.pyc\0", 0xE01, 0x81A4)         = 4 0

it first open test.py then it 'tried' to open test.pyc which returned an error then it called unlink and generated the pyc file again ... interesting, I thought it would check.

what if we delete the original py file.

$ sudo dtruss -a ./test.sh 2>&1 | grep "test.py*"
1107/0x670d:    655064       4      1 open_nocancel("test.py\0", 0x0, 0x1B6)         = -1 Err#2
1107/0x670d:    655069       8      4 open_nocancel("test.pyc\0", 0x0, 0x1B6)        = 3 0

no surprise there it couldn't open test.py but it still continued, to this day Im not sure if this is actually 'ok' python should give out some kind of warning, I've being burned a couple of times by this, accidentally deleting my files, running my tests and feeling a sigh of relief as they pass only to start sweating when I can't seem to find the source code!

After this tests we an assume python only uses pyc files either directly when invoked such as python test.pyc or indirectly when imported, otherwise it doesn't seem to use them.

Supposedly CPythons compiler was designed to be fairly fast, it doesn't do much type checking and it probably generates very high level byte-code so most of the work load is actually done by the virtual machine ... it probably does a single pass, lexing->compiler->byte-code in one go, it does this every time, it reads a python file from the command line or when importing and no pyc file is present in that case it creates it.

this may be why some other implementations are faster since they take more time to compile but generate far rawer byte-codes that can be well optimized.

Its extremely difficult to build a virtual machine to do pure interpretive efficiently ...

Its all about balance, the more powerful your bytecode the simpler your compiler can be but the more complex and slow your virtual machine has to be and vice-versa ...

Upvotes: 1

Armin Rigo
Armin Rigo

Reputation: 12900

Your points 1 to 5 are correct, with the exception (if we're precise) of the point 4. The Python interpreter has a part called the bytecode compiler that turns source code into <code object at 0x...>, which you can inspect by typing f.__code__ for any function f. This is the real bytecode that is interpreted. These code objects may then, as a separate step, be saved inside .pyc files.

Here are the operations in more details. The bytecode compiler runs only once per module, when you load the foo.py and each of the modules it imports. It's not a too long operation, but it still takes some time, particularly if your module imports a lot of other modules. This is where .pyc files enter the picture. After an import statement has invoked the bytecode compiler, it tries to save the resulting <code object> inside a .pyc file. The next time, if the .pyc file already exists and the .py file has not been modified, the <code object> is reloaded from there. This is just an optimization: it avoids the cost of invoking the bytecode compiler. In both cases the result is the same: a <code object> was created in memory and is going to be interpreted.

It only works for import statements, not for example for the main module (i.e. the foo.py in the command line python foo.py). The idea is that it should not really matter --- where the bytecode compiler would loose time in a typical medium-to-large program is in compiling all directly and indirectly imported modules, not just compiling foo.py.

Upvotes: 5

Related Questions