Reputation: 643
We have a significant ~(50kloc) tree of packages/modules (approx 2200files) that we ship around to our cluster with each job. The jobs run for ~12 hours, so the overhead of untarring/bootrapping (i.e. resolving PYTHONPATH for each module) usually isn't a big deal. However, as the number of cores in our worker nodes have increased, we've increasingly hit the case where the scheduler will have 12 jobs land simultaneously, which will grind the poor scratch drive to a halt servicing all the requests (worse, for reasons beyond our control, each job requires a separate loopback filesystem, so there's 2 layers of indirection on the drive).
Is there a way to hint to the interpreter the proper location of each file (without decorating the code with paths strewn throughout (maybe overriding import?)) or bundle up all the associated .pyc files into some sort of binary blob that can just be read once?
Thanks!
Upvotes: 1
Views: 493
Reputation: 15305
We had problems like this on our cluster. (The Lustre filesystem was slow for metadata operations.) Our solution was to use the "zip import" facilities in Python.
In our case we made a single zip of the stdlib (placed in the name given already in sys.path, like "/usr/lib/python26.zip") and another zip of our project, with the latter added to the PYTHONPATH.
This is much faster because it's a single filesystem metadata read, followed by a quick zip file read of the table-of-contents to figure out what's inside, and cache for later lookups.
Upvotes: 3