Mihir
Mihir

Reputation: 603

HBase shell - Retrieve (only) column values (and not column name)

I am pretty new to Hadoop and HBase, trying to learn and evaluate if it can be used for my use case. And being new to Java (I am basically Perl/Unix and DB developer), I am trying to get solution in Hbase shell if possible.

I have a HBase table (schema below) where I am trying to implement the historical data (which can be used for audit and analytics).

Assume the basic structure as below,

    rowkey 'cf1:id', 'cf1:price', 'cf1:user', 'cf1:timestamp'

Now, rowkey - instrument or any object id - using this to identify which col has latest data. First entry will have 1 as its value, and go on user - user which updated data

e.g.

initially the data looks like,

    hbase(main):009:0> scan 'price_history'
    ROW  COLUMN+CELL                                                                                                                 
    row1        column=cf1:id, timestamp=1389020633920,value=1
    row1        column=cf1:pr, timestamp=1389020654614, value=109.45
    row1        column=cf1:us, timestamp=1389020668338, value=feed
    row2        column=cf1:id, timestamp=1389020687334, value=1
    row2        column=cf1:pr, timestamp=1389020697880, value=1345.65
    row2        column=cf1:us, timestamp=1389020708403, value=feed

Now assume row2 or instrument 2 is updated on same day with new price,

    hbase(main):003:0> scan 'price_history'
    ROW                   COLUMN+CELL                        
    row1                 column=cf1:id, timestamp=1389020633920, value=1
    row1                 column=cf1:pr, timestamp=1389020654614, value=109.45
    row1                 column=cf1:us, timestamp=1389020668338, value=feed
    row2                   column=cf1:id, timestamp=1389020859674, value=2
    row2                 column=cf1:pr, timestamp=1389020697880, value=1345.65
    row2                 column=cf1:pr1, timestamp=1389020869856, value=200
    row2                 column=cf1:us, timestamp=1389020708403, value=feed
    row2                 column=cf1:us1, timestamp=1389020881601, value=user1`

If you see id is changed to 2 to indicate second set of data is latest. and new values or columns added.

What I want is,

    1) Can I fetch the value of columns id? i.e. the output should be 1 or 2 and not  all other attribs
    2) Based on the above o/p i will fetch the further data, but can I also have a search and o/p as value of rowkey? i.e. something like give me o/p of row having VALUE as row1 (I can have list of row1, row2, rown..)

Please assist if possible in HBase shell as much as possible (Other solutions are also welcomed)

Also, if any of the architect can suggest better solution to model the table to keep track of changes/versions of prices are also welcomed.

Thanks.

Upvotes: 1

Views: 3845

Answers (1)

mwebster
mwebster

Reputation: 111

That is going to be tough to do in the shell without doing a lot of piping output and grepping the results. The shell output formatting also makes this difficult because of how it breaks up lines. A lighter weight solution than writing Java would be to write your scanner in ruby. HBase comes with the jruby jar and lets you execute ruby scripts.

include Java
import "org.apache.hadoop.hbase.client.Scan"
import "org.apache.hadoop.hbase.util.Bytes"
import "org.apache.hadoop.hbase.client.HTable"

config = HBaseConfiguration.create()
family = Bytes.toBytes("family-name")
qual = Bytes.toBytes("qualifier"
scan = Scan.new()
scan.addColumn(family, qualifier)

table = HTable.new(config, "table-name")
scanner = table.getScanner(scan)
scanner.each do |result|
   keyval = result.getColumnLatest(family, qualifier) 
   puts "#{Bytes.toDouble(keyval.getValue())}"
end

That should get you pretty close, you can add additional data to the output for example the row key. To run it just use hbase org.jruby.Main your_ruby_file.rb

Upvotes: 2

Related Questions