Martin Thoma
Martin Thoma

Reputation: 136665

How can I check if a loaded Python function changed?

As a data scientist / machine learning developer, I have most of the time (always?) a load_data function. Executing this function often takes more than 5 minutes, because the executed operations are expensive. When I store the end result of load_data in a pickle file and read that file again, then the time often goes down to a few seconds.

So a solution I use quite often is:

def load_data(serialize_pickle_path, original_filepath):
    invalid_hash = True

    if os.path.exists(serialize_pickle_path):
        content = mpu.io.read(serialize_pickle_path)
        data = content['data']
        invalid_hash = mpu.io.hash(original_filepath) != content['hash']

    if invalid_hash:
        data = load_data_initial()
        filehash = mpu.io.hash(original_filepath)
        mpu.io.write(serialize_pickle_path, {'data': data, 'hash': filehash})

    return data

This solution has a major drawback: If the load_data_initial changed, the file will not be reloaded.

Is there a way to check for changes in Python functions?

Upvotes: 2

Views: 943

Answers (1)

abarnert
abarnert

Reputation: 365975

Assuming you're asking whether there's a way to tell whether someone changed the source code of the function between the last time you quit the program and the next time you start it…

There's no way to do this directly, but it's not that hard to do manually, if you don't mind getting a little hacky.

Since you've imported the module and have access to the function, you can use the getsource function to get its source code. So, all you need to do is save that source. For example:

def source_match(source_path, object):
    try:
        with open(source_path) as f:
            source = f.read()
        if source == inspect.getsource(object):
            return True
    except Exception as e:
        # Maybe log e or something here, but any of the obvious problems,
        # like the file not existing or the function not being inspectable,
        # mean we have to re-generate the data
        pass
    return False

def load_data(serialize_pickle_path, original_filepath):
    invalid_hash = True
    if os.path.exists(serialize_pickle_path):
        if source_match(serialize_pickle_path + '.sourcepy', load_data_initial):
            content = mpu.io.read(serialize_pickle_path)
            data = content['data']
            invalid_hash = mpu.io.hash(original_filepath) != content['hash']
    # etc., but make sure to save the source when you save the pickle too

Of course even if the body of the function hasn't changed, its effect might change because of, e.g., a change in some module constant, or the implementation of some other function it uses. Depending on how much this matters, you could pull in the entire module it's defined in, or that module plus every other module that it recursively depends on, etc.

And of course you can also save hashes of text instead of the full text, to make things a little smaller. Or embed them in the pickle file instead of saving them alongside.

Also, if the source isn't available because it comes from an module you only distribute in .pyc format, you obviously can't check the source. You could pickle the function, or just access its __code__ attribute. But if the function comes from a C extension module, even that won't work. At that point, the best you can do is check the timestamp or hash of the whole binary file.

And plenty of other variations. But this should be enough to get you started.


A completely different alternative is to do the checking as part of your development workflow, instead of as part of the program.

Assuming you're using some kind of version control (if not, you really should be), most of them come with some kind of commit hook system. For example, git comes with a whole slew of options. For example, if you have a program named .git/hooks/pre-commit, it will get run every time you try to git commit.

Anyway, the simplest pre-commit hook would be something like (untested):

#!/bin/sh
git diff --name-only | grep module_with_load_function.py && python /path/to/pickle/cleanup/script.py

Now, every time you do a git commit, if the diffs include any change to a file named module_with_load_function.py (obviously use the name of the file with load_data_initial in it), it will first run the script /path/to/pickle/cleanup/script.py (which is a script you write that just deletes all the cached pickle files).

If you've edited the file but know you don't need to clean out the pickles, you can just git commit --no-verify. Or you can expand on the script to have an environment variable that you can use to skip the cleaning, or to only clean certain directories, or whatever you want. (It's probably better to default to cleaning overzealously—worst-case scenario, when you forget every few weeks, you waste 5 minutes waiting, which isn't as bad as waiting 3 hours for it to run a bunch of processing on incorrect data, right?)

You can expand on this to, e.g., check the complete diffs and see if they include the function, instead of just checking the filenames. The hooks are just anything executable, so you can write them in Python instead of bash if they get too complicated.

If you don't know git all that well (or even if you do), you'll probably be happier installing a third-party library like pre-commit that makes it easier to manage hooks, write them in Python (without having to deal with complicated git commands), etc. If you are comfortable with git, just looking at hooks--pre-commit.sample and some of the other samples in the templates directory should be enough to give you ideas.

Upvotes: 6

Related Questions