Reputation: 189
Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Upvotes: 2
Views: 2197
Reputation: 321
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Upvotes: 3
Reputation: 1630
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.
Upvotes: 2