Reputation: 1905
I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence.
E.g.:
Job 1: Map source data into X levels.
Job 2: MapReduce Level1 -> appends to Level2
Job 3: MapReduce Level2 -> appends to LevelN
Job N: MapReduce LevelN -> appends to LevelN+1
And so on until the final level. The key is that each level must include its own specific source data as well as the results of the previous level.
I've looked at pig, hive, hamake, and cascading, but have yet to see clear support for something like this.
Does anyone know an efficient way of accomplishing this? Right now I'm leaning towards writing a wrapper for hamake that will generate the hamake file based on parameters (the number of levels is known at runtime but could change with each run).
Thanks!
Upvotes: 1
Views: 503
Reputation: 19666
You should be able to generate the pig code for this pretty easily using Piglet, the Ruby Pig DSL: http://github.com/iconara/piglet
Upvotes: 0
Reputation: 1275
Oozie http://yahoo.github.com/oozie/ is an Open Source server that Yahoo released to manage Hadoop & Pig workflow like you are asking
Cloudera has it in their latest distro with very good documentation https://wiki.cloudera.com/display/DOC/Oozie+Installation
here is a video http://sg.video.yahoo.com/watch/5936767/15449686 from Yahoo
Upvotes: 1