Reputation: 646
We are working on a project where we are using HBase as operational data store; all data is coming to hbase in real time. And, during every 2 hour, the data in Hbase needs to be synced to Hive. This is to enable analytical queries to run on top of latest data.
For syncing data from Hbase to Hive:
For insert/update only scenarios, I can use the timestamp column provided by hbase to know the inserted/updated records. For "DELETE" scenarios, I am struggling to find the right approach.
Does HBase Scan API provides any option to do that ?
Or should I go with any SQL options like Apache Phoenix for doing the same ?
Upvotes: 1
Views: 398
Reputation: 9425
Here is the answer from HBase Reference Guide, section Keep Deleted Cells:
A new "raw" scan options returns all deleted rows and the delete markers...
. . .[example]
hbase(main):017:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW COLUMN+CELL
r1 column=e:c1, timestamp=14, value=value
r1 column=e:c1, timestamp=12, value=value
r1 column=e:c1, timestamp=11, type=DeleteColumn
r1 column=e:c1, timestamp=10, value=value1 row(s) in 0.0120 seconds
. . .
Note that there can be different types of markers -- DeleteColumn or DeleteFamily -- depending on what kind of DELETE has occurred.
Upvotes: 1