Deploying Python analysis scripts to a server

Question

I'm the data guy at a small startup in Austin. All of the analysis that I do (thus far) is stored as a set of ad hoc scripts that I run from my laptop. This is a bad idea.

I'm going to sketch my plans here to deploy my analysis, and I'd like to know if there are any glaring holes that I've missed, or anything else that I should be considering. I think the outline I have keeps things atomic enough so that I can plug things in when and where I need to, but also allows me to run one script pretty easily. A secondary (longer-term) goal would be to put a simple web front-end up that allows a user (i.e., employee at my company) to invoke scripts one at a time, see here: Very simple web service: take inputs, email results .

I want to deploy the scripts to a server, and I'm thinking of organizing my scripts into a set of python modules. Then, I want my batch script to look like:

import analysis

batch_dict = analysis.build_batch_dict()
assert sorted(batch_dict.keys()) = ['Hourly', 'Monthly', 'Nightly', 'Weekly']

scripts_to_run = analysis.what_batch() # get from command line?
results_directory = analysis.make_results_directory()
failures = {}
for script in scripts_to_run:
    try:
        script.analyze()
        script.export_results(results_directory)
    except Exception as e:
        failures.update(script.failed(e))

analysis.completed(failures)

I'll rewrite my analyses so that they are handled by a class.

class AnalysisHandler(object):
    ...
    def analyze():
        pass
    def export_results(some_directory):
        pass
    def failed(exception):
        pass
    def run_with_non_default_args(*args, **kwargs):
        pass
    def something_else_im_missing_now():
        pass

All of the scripts will be handled by something that inherits from AnalysisHandler.

The directory structure on the server will look like:

datalytics/
    results/
        02-14-2013/
            script1/
                /log
                /error
                /data
            script2/
                /log
                /error
                /data
            .
            .
            .
            
    scripts/
        script1/
            bin/
            data/
            doc/
            script_1/
            tests/
            setup.py
        .
        .
        .
    analysis/
        __init__.py
        analysis.py
    batch.py (see above)
    new_script.py
    run_all_tests.py
    run_some_tests.py
    run_this_script.py
    run_everything.py

pyInTheSky · Accepted Answer

As a formal answer:

Watch out for and try to catch the exception when creating files (you may get permissions issues), especially when you add a front end and want to perhaps write in a directory an employee can't modify
With your current setup, I'd suggest wrapping the loop inside the 'with' statement, keeping an open file handle, and flushing results as they come in. This allows you to keep track of progress to some extent, and also lets you know if your server crashes, whether your tests were running, and possibly if one of them caused the crash
You seem to be developing quite a bit of framework. While the unittest module is meant to test python code, it can certainly be leveraged to replace a lot of your framework (ie sorting tests, specifying which tests to run, logging, etc). It will also provide you with an easy way of attaching an interface and then marking tests for expected failures and the like.
As important as output is to have, it's also important to make it useful. You may want it printed in clean text, but if it started in say, a python dictionary, and you flattened it, add a comma to your log file, and dump the dict as a string so you can scoop that thing right back into python and manipulate data if need be.
going off the last point, take a look at json.dumps and json.loads, especially for working w/ logs and a web UI. javascript is friendly w/ that format, and you can save yourself a whole lot of work by keeping everything in a 'python-happy' format
Threading the tests isn't hard to add if you need it. If you know you have one task that takes a long time to run, or is IO intensive, maybe let that go spin off on its own. If you do think you are going to go that route, plan on it from the beginning and be aware of variables you may need to protect, or push all your results onto a queue instead of a dictionary to head off the problem of race conditions
Be aware of the resolution of timestamps, especially if you are loading a dictionary that uses a timestamp as a key o_O ---> just don't do it
I noticed the config function and that it accepted args and kwargs. If you are going to allow configuring of tests, either via ui or file, especially if it's a file, use json friendly format, you can do kwargs = json.loads( open(configfile,'r').read() ) and be super happy that you didn't have to write a parser or regex, or modify your code when you add a parameter.

Deploying Python analysis scripts to a server

Answers (1)

Related Questions