Reputation: 1869
I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
After putting ("1","cf1:c2",t2)
in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2
the mentioned Htable becomes:
row_key cf1:c1 timestamp
----------------------------------------
1 x t1
1 x t2
where t2>t1
.
My question would be how can I change this scenario in a way that the first version of cell would be the only version that could be store and retrieve. I mean in the provided example the only version would be 't1'
one! Thus, I want to change hbase in a way that ignore insertion on duplicates.
I know that setting VERSIONS to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis()
would solve my problem but I dont know is it the best solution or not?! What is the concern of changing tstamp to Long.MAX_VALUE - System.currentTimeMillis()
? Does it has any performance issue?
Upvotes: 3
Views: 741
Reputation: 19987
There are two strategies that I can think of:
Setting VERSIONS
to 1 for Htable and putting based on Long.MAX_VALUE - System.currentTimeMillis()
will generally work and does not have any major performance issues.
On write:
System.currentTimeMillis()
. It should be noted that this might not actually be the machine who tried to write to the cell first, since hbase clients might be out of sync. On read:
To get true ordering through atomicity, meaning only the first write to reach the region server will succeed, you can use the checkAndPut
operation:
From the docs:
public boolean checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) throws IOException
Atomically checks if a row/family/qualifier value matches the expected value. If it does, it adds the put. If the passed value is null, the check is for the lack of column (ie: non-existance)`
So by setting value
to null
your Put
will only succeed if the cell did not exist. If your Put succeeded then the return value will be true. This gives true atomicity, but at a write performance cost.
On write:
Get
is issued internally before existance is checked. Once non-existance is confirmed the Put is issued. As you can imagine this has a pretty big performance impact for each write, since each write now also involves a read and a lock.checkAndPut
operations by using checkAndMutate
, since each Put needs it own check. This means each put needs to be a separate request, which means you will be paying a latency cost as well when writing in batches.On read:
If true ordering really matters or you may need to read each row after or before you write to hbase anyway (for example to find out if your write succeeded or not), you're better of with strategy 2, otherwise, in all other cases, I'd recommend strategy 1, since its write performance is much better. In that case just make sure your clients are properly time synced.
Upvotes: 3
Reputation: 4529
You can insert the Put
with Long.MAX_VALUE - timestamp
and configure the table to store only 1 version (max versions => 1). This way only the first (earliest) Put will be returned by the Scan because all successive Puts will have a smaller timestamp value.
Upvotes: 0