Amit Tripathi
Amit Tripathi

Reputation: 7261

Can compile function output be used to detect changes in the source code

Python's builtin function compile()

code = """def fact(x):
    if x <= 1:
        return 1
    print (1)
    return x*fact(x-1)
fact(10)"""
c = compile(code, "<string>", "exec")

code = """def fact(y):
    if y <= 1:
        return 1
    print (1)
    return y*fact(y-1)
fact(10)"""
d = compile(code, "<string>", "exec")

here c == d is False. Which is the expected behavior? (adding space in source code or making print 1 instead of print(1) doesn't result in a changed object, which is correct.)

My question is: Can compile object be used to detect changes in the Python source code?

Edit:

To explain it more clearly, I am working on a web application that will allow users to execute their Python code.

Every time a user executes code, its executed on the server. Even adding a space/bracket etc results in a new execution. I am trying to optimize this step by storing the compiled code and if the code is the same in the new request then not executing it.

How do I know that code has changed? (Is it the kind of change that requires executing the code again or a simple space addition.)

I do think there are other ways to achieve what I am trying to achieve by using hashed value or something like this but I want to do it this way because it seems more Pythonic and better than reinventing the wheel.

Upvotes: 2

Views: 334

Answers (3)

Dimitris Fasarakis Hilliard
Dimitris Fasarakis Hilliard

Reputation: 160617

Don't do this, as an optimization it won't result in great speed ups.

compile is a built-in function, it is implemented in C, it is fast and is not a place where you should be looking to optimize. When a user tries to execute code you should allow it to be compiled and executed without any sort of caching.

Consider the following, writing code that tries to discover if there's any difference in text entered by a user, will most likely result in more execution time than actually just compiling the function and executing it. Apart from this, you also have the fact that you'll have to store and retrieve the compile code somewhere. The first adds unnecessary space requirements and, the second, adds to the overhead because of the look ups you have to do (depending on how you chose to store it, of course).


Apart from what I just said, you can try and compare the result of compiling Python code but the approach is limited and obfuscated. I will assume the code snippets you entered should compare equal since the only difference between them lies in the parameter names (whose name has no impact on code execution).

Use the byte code:

In general the approach I'm going to present only returns "equal" (True) if the code snippets in question only differ in white-space and/or parameter names, in all other cases, it returns False (you could attempt to make it more intelligent but that would probably require you taking many edge cases into consideration).

What you should generally be aware of is that compile returns a code object. code objects generally contain all the information Python needs in order to execute a piece of code. You can view the instructions contained in a code object by using the dis module:

from dis import dis
dis(c)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e630, file "<string>", line 1>)
              3 MAKE_FUNCTION            0
              6 STORE_NAME               0 (fact)

  6           9 LOAD_NAME                0 (fact)
             12 LOAD_CONST               1 (10)
             15 CALL_FUNCTION            1
             18 POP_TOP             
             19 LOAD_CONST               2 (None)
             22 RETURN_VALUE  

This is what your code snippet does. It loads another code object (which represents the function fact), makes a function call and returns.

The same exact output is generated when you call dis(d), the only difference lies in the loading of the code object for fact:

dis(d)
  1           0 LOAD_CONST               0 (<code object fact at 0x7fa7bc30e5b0, file "<string>", line 1>)

As you can see, they are different:

code object fact at 0x7fa7bc30e630 != code object fact at 0x7fa7bc30e5b0

but their difference does not lie in their functionality, rather they are different instances representing a unit of functionality that is identical.

So here lies the question: do we care about the fact that they are different instances if they represent the same functionality? Similarly, do we care if two variables have different names if they represent the same value? I'm assuming we don't and this is why I think you can make a meaningful basic comparison of code objects if you compare their raw byte code.

Comparing the raw byte code:

Raw byte code is pretty much what I previously described, no names, no identities just a sequence of commands for the Python interpreter (pure functionality). Let's take a look at it:

c.co_code
'd\x00\x00\x84\x00\x00Z\x00\x00e\x00\x00d\x01\x00\x83\x01\x00\x01d\x02\x00S'

Okay, that's ugly, let's have a better look:

dis(c.co_code)
  0 LOAD_CONST          0 (0)
  3 MAKE_FUNCTION       0
  6 STORE_NAME          0 (0)
  9 LOAD_NAME           0 (0)
 12 LOAD_CONST          1 (1)
 15 CALL_FUNCTION       1
 18 POP_TOP        
 19 LOAD_CONST          2 (2)
 22 RETURN_VALUE 

That looks better. This is exactly the same as the previous dis(c) command, the only difference is that the names are not present since they don't really play a part in all this.

So, what do we get if we compare d.co_code with c.co_code? Of course, equality since the commands executed are identical. But, there's a pit-fall here, in order to be 100% sure that d.co_code is equal to c.co_code we also need to compare the code objects for the functions loaded inside c and d (the code objects representing the function fact) if we don't compare these, we'll be getting false positives.

So where does the code object for the functions fact lie in each case? They lie in a field called co_consts inside the code object c and d respectively, co_consts is a list containing all the constants for a specific code object. If you take a peak inside there, you'll be able to see the definition for each of these:

# located at position 0 in this case.
# the byte code for function 'fact'
dis(c.co_consts[0])  
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (1)
              6 COMPARE_OP               1 (<=)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_CONST               1 (1)
             15 RETURN_VALUE        

  4     >>   16 LOAD_CONST               1 (1)
             19 PRINT_ITEM          
             20 PRINT_NEWLINE       

  5          21 LOAD_FAST                0 (x)
             24 LOAD_GLOBAL              0 (fact)
             27 LOAD_FAST                0 (x)
             30 LOAD_CONST               1 (1)
             33 BINARY_SUBTRACT     
             34 CALL_FUNCTION            1
             37 BINARY_MULTIPLY     
             38 RETURN_VALUE        

So, what do we do? We compare their raw byte code to see if they represent the same functionality, as we did previously.

As you can understand this turns out to be a recursive procedure where we first compare the raw byte code for the input, then scan through co_consts to see if another code object is present and repeat until no code objects can be found, if the code objects are in a different position in co_consts we'll return False.

In code, that should look like this:

from types import CodeType 

def equal_code(c1, c2):
    if c1.co_code != c2.co_code:
        return False
    for i, j in zip(c1.co_consts, c2.co_consts):
        if isinstance(i, CodeType):
            if isinstance(j, CodeType):
                return equal_code(i, j)
            else: # consts in different position, return false
                return False
    return True

Where CodeType from types is used to check for code instances.

I think this is pretty much the best you can do using only code objects generated from compile.

Upvotes: 3

Useless
Useless

Reputation: 67802

So many questions...

Can compile object be used to detect changes in the Python source code?

Yes, sort of. However, since code objects contain the names of their local variables, renaming x to y will make your comparison fail even though there's no functional change.


How do I know that code has changed? (Is it the kind of change that requires executing the code again or a simple space addition.)

It's worth mentioning that a "simple space addition" may require re-compilation and re-execution in Python, more so than in many other languages.

I do think their are other way to achieve what I a trying to achieve by using hashed value or something like this but I want to do it this way because it seems more Pythonic and better than reinventing the wheel.

I'm not sure what's Pythonic about this particular option - maybe it makes your code a lot simpler? That would be a perfectly good reason to choose it.

Otherwise, string comparison is probably faster (with the same sensitivity to variable renaming), and a full AST comparison is more complex but potentially smarter.


Finally:

Every time an user executes his/her code its executed on the server. Even adding an space/bracket etc results into a new execution. I am trying to optimize this step by storing the compiled code and if the code is same in new request then not executing it.

But you probably should execute the user's code when they explicitly ask you to.

If you were doing this every time they typed a character, it would obviously pointless, but consider code using random numbers: the user would reasonably expect to see the output change when they hit execute, even with no code change.

Upvotes: 1

Burhan Khalid
Burhan Khalid

Reputation: 174690

There are many ways to do this that are a lot simpler than what you are trying:

  1. Use diff, with the -w (ignore whitespace) flag. If there is a resulting difference, you will know it is highly likely that it is a code change.

  2. Use git or other source code repository. Then find out if there is a change done between files before deciding to execute. However, in this respect you are just using the diff-ing capabilities of git, so might as well go with the first option.

Upvotes: 2

Related Questions