Reputation: 7169
I am relatively new to distributed computing, so forgive me if I misunderstand some of the basic concepts here. I am looking for a (preferably) Python-based alternative to Hadoop for processing large data sets via MapReduce on a cluster using an SGE-based grid engine (eg. OpenGrid or Sun of Grid Engine). I have had good luck running basic distributed jobs with PythonGrid, but I'd really like a more feature-rich framework for running my jobs. I have read up on tools like Disco and MinceMeatPy, both of which seem to offer true Map-Sort-Reduce job processing, but their does not seem to be any obvious support for SGE. This makes me wonder if it is possible to achieve true MapReduce functionality using a grid scheduler, or if people just don't support it out-of-the-box because they are not frequently used. Can you perform Map-Sort-Reduce tasks on a grid engine? Are their Python tools that support this? How difficult would it be to rig existing MapReduce tools to use SGE job schedulers?
Upvotes: 2
Views: 1087
Reputation: 590
I've heard that Jug works. It's using the filesystem for coordination amongst the parallel tasks. In that kind of framework, you'd write your code and run "jug status primes.py" on the machine you're on then start a grid array job with as many workers as you like, all running "jug execute primes.py".
mincemeat.py should be able to function in the same way but looks to use the network for coordination. So that may depend on whether your nodes can talk to a server running the overall script.
There are several release notes about running actual Hadoop MapReduce and HDFS on SGE, but I haven't been able to find good documentation.
If you're used to Hadoop streaming with Python, it's not too bad to replicate on SGE. I've had some success with this at work: I run an array job that does map + shuffle for each input file. Then another array job that does sort + reduce for each reducer number. The shuffle part just writes files to a network dir like mapper00000_reducer00000, mapper00000_reducer00001, and so on (all pairs of mapper and reducer numbers). Then reducer 00001 sorts all files labeled reducer00001 together and pipes to reducer code.
Unfortunately, Hadoop streaming isn't very full-featured.
Upvotes: 2