Reputation: 187
Hi I am quite new to hadoop and I am trying to import a csv table to Hbase using MapReduce.
I am using hadoop 1.2.1 and hbase 1.1.1
i have data in following format:
Wban Number, YearMonthDay, Time, Hourly Precip
03011,20060301,0050,0
03011,20060301,0150,0
I have written the following code for bulk load
public class BulkLoadDriver extends Configured implements Tool{
public static void main(String [] args) throws Exception{
int result= ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
}
public static enum COUNTER_TEST{FILE_FOUND, FILE_NOT_FOUND};
public String tableName="hpd_table";// name of the table to be inserted in hbase
@Override
public int run(String[] args) throws Exception {
//Configuration conf= this.getConf();
Configuration conf = HBaseConfiguration.create();
Job job= new Job(conf,"BulkLoad");
job.setJarByClass(getClass());
job.setMapperClass(bulkMapper.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
TableMapReduceUtil.initTableReducerJob(tableName, null, job); //for HBase table
job.setNumReduceTasks(0);
return (job.waitForCompletion(true)?0:1);
}
private static class bulkMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
//static class bulkMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String [] val= value.toString().split(",");
// store the split values in the bytes format so that they can be added to the PUT object
byte[] wban=Bytes.toBytes(val[0]);
byte[] ymd= Bytes.toBytes(val[1]);
byte[] tym=Bytes.toBytes(val[2]);
byte[] hPrec=Bytes.toBytes(val[3]);
Put put=new Put(wban);
put.add(ymd, tym, hPrec);
System.out.println(wban);
context.write(new ImmutableBytesWritable(wban), put);
context.getCounter(COUNTER_TEST.FILE_FOUND).increment(1);
}
}
}
I have created a jar for this and ran following in the terminal:
hadoop jar ~/hadoop-1.2.1/MRData/bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table
But the output that I get is hundreds of following type of lines: attempt_201509012322_0001_m_000000_0: [B@2d22bfc8 attempt_201509012322_0001_m_000000_0: [B@445cfa9e
I am not sure what do they mean and how to perform this bulk upload. please help.
Thanks in advance.
Upvotes: 0
Views: 3399
Reputation: 1253
There are several ways to import data into HBase. Please have a look at this following link:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_import.html
HBase BulkLoad:
Data file in CSV format
Process your data into HFile format. See http://hbase.apache.org/book/hfile_format.html for details about HFile format. Usually you use a MapReduce job for the conversion, and you often need to write the Mapper yourself because your data is unique. The job must to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. The Reducer is handled by HBase; configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following:
One HFile is created per region in the output folder. Input data is almost completely re-written, so you need available disk space at least twice the size of the original data set. For example, for a 100 GB output from mysqldump, you should have at least 200 GB of available disk space in HDFS. You can delete the original input file at the end of the process.
Load the files into HBase. Use the LoadIncrementalHFiles command (more commonly known as the completebulkload tool), passing it a URL that locates the files in HDFS. Each file is loaded into the relevant region on the RegionServer for the region. You can limit the number of versions that are loaded by passing the --versions= N option, where N is the maximum number of versions to include, from newest to oldest (largest timestamp to smallest timestamp). If a region was split after the files were created, the tool automatically splits the HFile according to the new boundaries. This process is inefficient, so if your table is being written to by other processes, you should load as soon as the transform step is done.
Upvotes: 2