hp36
hp36

Reputation: 279

Multiple rows insertion in HBase using MapReduce

I want to insert N rows to HBase table from each mapper in batch. I am currenly aware of two ways of doing this:

  1. Create a list of Put objects and use put(List<Put> puts) method of HTable instance and also make sure to disable autoFlush parameter.
  2. Use TableOutputFormat class and use context.write(rowKey, put) method.

Which one is better?

In 1st way, context.write() is not required since hTable.put(putsList) method is used to directly put data in table. My mapper class is extending Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>, so what classes should I use for KEYOUT and VALUEOUT?

In 2nd way, I have to call context.write(rowKey, put) N times. Is there any way by which I can use context.write() for list of Put operations?

Is there any other way of doing this with MapReduce?

Thanks in advance.

Upvotes: 2

Views: 1668

Answers (1)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29227

I prefer second option where batching is natural(no need for list of puts) for mapreduce.... to have deep insight please see my second point

1) Your first option List<Put> is generally used for Standalone Hbase Java client. Internally it is controlled by hbase.client.write.buffer like below in one of your config xmls

<property>
         <name>hbase.client.write.buffer</name>
         <value>20971520</value> // around 2 mb i guess
 </property>

which has default value say 2mb size. once you buffer is filled then it will flush all puts to actually insert in to your table. which is same way as BufferedMutator as explained in #2

2) Regarding second option, if you see TableOutputFormat documentation

org.apache.hadoop.hbase.mapreduce
Class TableOutputFormat<KEY>

java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
org.apache.hadoop.hbase.mapreduce.TableOutputFormat<KEY>
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

@InterfaceAudience.Public
@InterfaceStability.Stable
public class TableOutputFormat<KEY>
extends org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
implements org.apache.hadoop.conf.Configurable
Convert Map/Reduce output and write it to an HBase table. The KEY is ignored

while the output value must be either a Put or a Delete instance.

-- Other way of seeing this through code is like below.

/**
     * Writes a key/value pair into the table.
     *
     * @param key  The key.
     * @param value  The value.
     * @throws IOException When writing fails.
     * @see RecordWriter#write(Object, Object)
     */
    @Override
    public void write(KEY key, Mutation value)
    throws IOException {
      if (!(value instanceof Put) && !(value instanceof Delete)) {
        throw new IOException("Pass a Delete or a Put");
      }
      mutator.mutate(value);
    }
  }

conclusion : context.write(rowkey,putlist) It is not possible with API.

However, BufferedMutator ( from mutator.mutate in above code) says

Map/reduce jobs benefit from batching, but have no natural flush point. {@code BufferedMutator} receives the puts from the M/R job and will batch puts based on some heuristic, such as the accumulated size of the puts, and submit batches of puts asynchronously so that the M/R logic can continue without interruption.

so your batching is natural(with BufferedMutator) as aforementioned

Upvotes: 1

Related Questions