Reputation: 19021
Taking the write-once constraint of HDFS that underlies HBase into consideration, it sounds to me inappropriate to use HBase as a database to manage often-changed, per-user setting values of tens of millions of users. Setting values here are, for example, boolean values to control visibilities of users' personal information (such as birthday, phone numbers and email addresses) and per-friend flags to control who are allowed to access visible pieces of the personal information. I'm worried that storage size may grow and grow every time users change their setting values even if HBase merges multiple changes into one write onto HDFS.
However, I'm not sure if it is really inappropriate. My understanding may be wrong. Could you give me comments about this, please?
Upvotes: 1
Views: 208
Reputation: 9457
To expand a bit on Jacob's answer, understanding why HBase is good for oft-changed values involves understanding the approach of Log Structured Merge Trees.
Unlike typical relational databases (which use B+ trees, and "update in place" semantics), all writes to HBase are treated as timestamped appends. For every PUT you do, regardless of whether it's a new value ("INSERT", in RDBMS language) or for an existing key ("UPDATE", in RDBMS land), two things happen:
The next time there's enough new stuff in memory to warrant it, the stuff in memory gets flushed out to disk (which, again, is pretty fast since it's already sorted). And, depending on the settings you used on the table (e.g. whether you want to keeps lots of past versions around, whether you want to keep values that were deleted, etc.), older versions of the values might get cleaned out immediately at flush time as well.
In either case, though, it's obvious that over time, different versions of a single value might be lodged in more than one of these store files, and a single read is going to have to hit many store files. That's where compactions come in: to combine many store files into one, so that reads don't have to do that.
Upvotes: 2
Reputation: 6671
HDFS which HBase uses for its file system is an append-only file system, meaning no part of the file is ever over-written. New changes are packed on-top of old changes, much like CouchDB.
However unlike CouchDB, HBase manages its own splitting and compaction.
It is important to stress that major compactions are absolutely necessary for StoreFile cleanup, the only variant is when they occur. They can be administered through the HBase shell, or via HBaseAdmin.
During compaction, your old data will be released, and space freed up.
You should probably separate your frequently-changed data into its own column family and perhaps turn compression on. Unfortunately at this time, flushing is done globally and not per column family, however HBase-3149 is addressing that.
I suppose to directly answer your questions, yes HBase can store frequently modified data. Just make sure you have someone carefully read the configurations page and make good decisions according to your situation.
Upvotes: 4