Reputation: 167
I am new to Big data technologies, I have a question on how hbase is integrated with hadoop. What does it mean by "Hbase sits on top of HDFS"? . My understanding is HDFS is a collection of structured and unstructured data distributed across multiple nodes and HBase is structured data.
How is Hbase integrated with Hadoop to provide real time access to the underlying data. Do we have to write special jobs to build indexes and such? In other words is there an additional layer between Hbase and hdfs that has data in the structure HBase understands
Upvotes: 1
Views: 1090
Reputation: 435
It's easy to understand:
HDFS is a distributed filesytem and provides write and read through an apped model.
Hbase is a NOSQL database that builds on the HDFS filesystem and must depend on it.
This can be read about here: Apache hbase document
Upvotes: 1
Reputation: 1494
HDFS is a distributed filesystem; One can do most regular FS operations on it such as listing files in a directory, writing a regular file, reading a part of the file, etc. Its not simply "a collection of structured or unstructured data" anymore than your EXT4 or NTFS filesystems are.
HBase is a in-memory Key-Value store which may persist to HDFS (it isn't a hard-requirement, you can run HBase on any distributed-filesystem). For any read key request asked of HBase, it will first check its runtime memory caches to see if it has a value cached, and otherwise visit its stored files on HDFS to seek and read out the specific value. There are various configurations in HBase offered to control the way the cache is utilised, but HBase's speed comes from a combination of caching and indexed persistence (faster, seek-ed file reads).
HBase's file-based persistence on HDFS does the key indexing automatically when it writes, so there is no manual indexing need by its users. These files are regular HDFS files, but specialised in format for HBase's usage, known as HFiles.
These articles are slightly dated, but are still very reflective of the architecture HBase uses: http://blog.cloudera.com/blog/2012/06/hbase-write-path/ and http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/, and should help if you want to dig deeper.
Upvotes: 2
Reputation: 682
HDFS is a distributed file system, and HBase is a NoSQL database that depends on the HDFS filesystem to store it's data.
You should read up on these technologies, since your structured/unstructured comparison is not correct.
Update
You should check out the Google File System, MapReduce, and Bigtable papers if you are interested in the origins of these technologies.
Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
Chang, Fay, et al. "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.
Upvotes: 1