Reputation: 27
We are using HBase bulk loading techniques as explained in: http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/ (That is: Creating HFiles directly using HFileOutputFormat)
We have to go with this option to pre-populate HBase cluster with all the data we already have in legacy system(s).
As HBase does not support secondary tables (or indexes), we maintain secondary tables (or indexes) at application level.
Now the question is on how do we use bulk load technique to create HFiles of different tables (main table and secondary tables/indexes). Is there any multiple-HFileOutputFormat (like HFileMultiOutputFormat)?
I understand that we could create multiple MR Jobs and run each job separately. The cost comes from the 'reading' of so much of data (more than few TB). I wanted to find a way where I can read-once and write-multiple-times. Chaining MR Jobs does not help as all Map tasks need same data and chaining restricts the 2nd map task to get the output of 1st map task.
Similar questions have been asked here, here. But they are unanswered hence trying out again.
Upvotes: 1
Views: 659
Reputation: 1810
First of all very valid requirement.
First step to implement is to go through and understand the code for HFileOutputFormat : HFileOutputFormat
The portion you are interested in is the directory structure it creates using column families. You will want to create a directory structure with table --> Column Family --> HFile
You can use multiple outputs to write diff table data.
Upvotes: 0