Reputation: 31

HDFS vs NoSQL (HBASE), How does it work?

I went through all the resources (almost) on google and I did not get something regarding Hadoop and NoSQL.

Let's imagine that I have lot of data to store. I use Hadoop and it's native HD File System BUT I also want to get real-time informations, so I need NoSQL. Where do my database going to be installed ? On the datanode itself ? On the namenode ? Both ?

Let's imagine (once again) that I have data coming in my system, the namenode going to cut it into several pieces and replicate them on different datanode. With NoSQL, will it work the same way ? Does HDFS take a part in this process ?

The metadatas stored in the NameNode give the adress, so how is it going to be queried ?

I think I understand the concept of Hadoop and HBase basically but I misunderstand when I go further..

Upvotes: 2

Answers (4)

Ravindra babu

Reputation: 38950

Regarding your queries

1) Data will be stored in datanodes (RegionServer). HBase need to be installed on data nodes

2) HBase is not required on namenode.

Have a look at : Role of datanode, regionserver in Hbase-hadoop integration and Should the HBase region server and Hadoop data node on the same machine? quesitons too.

HOW HBASE WORKS?

Since you know Hadoop & HBase individually, I am not explaining both of them in details. I will explain how HBase works with HDFS/Hadoop eco system ( Formatted and edited for readability of Edureka article)

HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS. It can handle peta bytes of data.

HBase uses auto-sharding feature, which implies large tables are dynamically distributed by the system.

The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table’s data and they are essentially a contiguous, sorted range of rows.

Initially, there is only one region for a table. When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.

In HBase the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.

The HBase architecture has two main services: HMaster : which is responsible to coordinate the cluster and execute administrative operations, AND the HRegionServer responsible for handling a subset of the table’s data.

HBase Master coordinates the HBase Cluster and is responsible for administrative operations.

A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup. The master can decide to move a Region from one Region Server to another when there is a need of load balancing. The Master also handles Region Server failures by assigning the region to another Region Server.

The mapping of Regions and Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all and clients can go directly to the Region Server responsible to serve the requested data.

To identify the Region Server, the client does a query on the META table.

META is a system table used to keep track of regions. It contains the server name and a region identifier comprising a table name and the start row-key. By looking at the start-key and the next region start-key, clients are able to identify the range of rows contained in a a particular region.

If you need low latency for real time data access (less than 10 TB) use SPARK. Hadoop is good for batch processing of large data. SPARK + HBASE is good combination for your requirements.

Have a look at this cloudera article regarding HBase-Spark Module

Upvotes: 1

Ramzy

Reputation: 7138

HDFS is a file system. The features it provides are known to you already. The issue is where does NOSQL stuff comes in. It is for real time processing, for format which do not need to adher to Relational DB and other features which will help us to process large data.

There are various NOSQL databases. Based on your choice, if you go with Hbase, then you need HDFS. So here comes the name node and data node. This is because Hbase works on top of HDFS.

If you choose Cassandra for example, you dont need HDFS. ofcourse it supports Hadoop, but nor required. Again Hbase and Cassandra are tip of the iceberg when it comes to NoSQL DB's. You can get full list of NOSQL db's here

Upvotes: 1

Aman Mundra

Reputation: 864

Overlapping capabilities of Hadoop and NoSQL

Both hadoop and NoSQL are great for managing large and rapidly growing data sets. They both can handle a variety of data formats including log files, documents and rich media. Also both can leverage commodity hardware and support horizontal scaling. if you have structured data in which the structure differs between records, or if the structure likely will change in the future, then both NoSQL and Hadoop are appropriate technologies for your usecase.

Both the technologies are intended for different type of workloads

NoSQL is meant for real time access and includes both read/write processes. While hadoop is about large scale data processing.

Both can co-exist in an EDW

NoSQL and Hadoop can exist/work/deployed together in an enterprise data architecture. In a typical architecture, NoSQL can be used for real time & interactive data, and Hadoop cluster can be used for large-scale data processing(Batch Mode) and analytics.

Role of Hbase

Now HBase is a column based NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on HDFS, with real-time data access as a key/value store and deep analytic capabilities of Map-Reduce.

Upvotes: 2

kostya

Reputation: 9569

HDFS is a distributed file system (DFS). It allows you to view disks of multiple machines (data nodes) as a huge single disk. In order to read and write files from HDFS you need a client application that talks over network to name node (for metadata about files and directories) and to data nodes (for the actual file data).

HBase is a distributed key-value store that uses HDFS to store its data. To achieve the best performance out of HBase you need to run HBase nodes on the same servers as you run HDFS data nodes to avoid unnecessary network calls.

Upvotes: 1

HDFS vs NoSQL (HBASE), How does it work?

Answers (4)

Related Questions