issue while inserting data in hbase using ImportTsv

Question

I was trying to insert the data in hbase using the following commands:

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,f:pageviews,f:visit -Dimporttsv.separator=\001 -Dimporttsv.bulk.output=output modelvar /000000.gz


hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles modelvar output

where modelvar is the final hbase table in which data is suppose to be stored. output is the HDFS path where Hfiles are stored. Now the problem is the data which i was trying to insert is the output of hive. So, default separator would be \001 which i can't change. So, I kept the -Dimporttsv.separator= value to be \001. But, we can't keep the multiple character as a separator. So, how do i insert the data in hbase written by hive

ALSimon · Accepted Answer

IMO, you can't use byte in the hadoop configuration.

But we can notice the property 'importtsv.separator', defined by org.apache.hadoop.hbase.mapreduce.ImportTsv.SEPARATOR_CONF_KEY is Base64 encode in org.apache.hadoop.hbase.mapreduce.ImportTsv:245

public static Job createSubmittableJob(Configuration conf, String[] args)
  throws IOException, ClassNotFoundException {

    // Support non-XML supported characters
    // by re-encoding the passed separator as a Base64 string.
    String actualSeparator = conf.get(SEPARATOR_CONF_KEY);
    if (actualSeparator != null) {
      conf.set(SEPARATOR_CONF_KEY,
               Base64.encodeBytes(actualSeparator.getBytes()));
    }
    ...
}

Is decoded in org.apache.hadoop.hbase.mapreduce.ImportTsv:92

protected void doSetup(Context context) {
    Configuration conf = context.getConfiguration();

    // If a custom separator has been used,
    // decode it back from Base64 encoding.
    separator = conf.get(ImportTsv.SEPARATOR_CONF_KEY);
    if (separator == null) {
      separator = ImportTsv.DEFAULT_SEPARATOR;
    } else {
      separator = new String(Base64.decode(separator));
    }
    ...
}

And finally is checked to be a single byte in org.apache.hadoop.hbase.mapreduce.ImportTsv:97

public TsvParser(String columnsSpecification, String separatorStr) {
      // Configure separator
      byte[] separator = Bytes.toBytes(separatorStr);
      Preconditions.checkArgument(separator.length == 1,
        "TsvParser only supports single-byte separators");
      separatorByte = separator[0];
      ...
}

As a solution, I suggest you redeclare a main method which modify the configuration's properties before the execution.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.mapreduce.ImportTsv;

public class ImportTsvByteSeparator extends ImportTsv
{
    /**
     * Main entry point.
     *
     * @param args  The command line parameters.
     * @throws Exception When running the job fails.
     */
    public static void main(String[] args) throws Exception {

      // We just have to modify the configuration
      Configuration conf = HBaseConfiguration.create();
      int byteSeparator = conf.getInt("importtsv.byte_separator", 001);
      String separator = Character.toString((char) byteSeparator);
      conf.set("importtsv.separator", separator);

      // Now we call ImportTsv main's method
      ImportTsv.main(args);
    }
}

I don't think we can overwrite some method inside the process (as createSubmittableJob()) because of the visibility of the attributes.

issue while inserting data in hbase using ImportTsv

Answers (2)

Related Questions