user1405023
user1405023

Reputation:

Hadoop and HBase

hi I am new to hbase and hadoop. I couldn't find that why we are using hadoop with hbase. I know hadoop is a file system but I read that we can use hbase without hadoop so why are we using hadoop?
thx

Upvotes: 10

Views: 22585

Answers (8)

Tariq
Tariq

Reputation: 34184

Hadoop is a platform that allows us to store and process large volumes of data across clusters of machines in a parallel manner..It is a batch processing system where we don't have to worry about the internals of data storage or processing.

It not only provides HDFS, the distributed file system for reliable data storage but also a processing framework, MapReduce, that allows processing of huge data sets across clusters of machines in a parallel manner.

One of the biggest advantage of Hadoop is that it provides data locality.By that I mean that moving data that is do huge is costly. So Hadoop moves computation to the data.Both Hdfs and MapReduce are highly optimized to work with really large data.

HDFS assures high availability and failover through data replication, so that if any one the machines in your cluster is down because of some catastrophe, your data is still safe and available.

On the other hand HBase is a NoSQL database.We can think of it as a distributed, scalable, big data store. It is used to overcome the pitfalls of Hdfs like "inability of random read and write".

Hbase is a suitable choice if we need random, realtime read/write access to our data.It was modeled after Google's "BigTable", while Hdfs was modeled after the GFS(Google file system).

It is not necessary to use Hbase on top Hdfs only.We can use Hbase with other persistent store like "S3" or "EBS". If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org".

You can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".

Upvotes: 12

khan
khan

Reputation: 2674

The Hadoop distributed file system named as HDFS provides multiple jobs for us. Actually we can't say Hadoop is only a file system but it also provide us resources so can we perform distributed processing by providing us a master slave architecture from which we can easily manage our data.

As for the HBase concern , simply let me tell you that you can't connect remotely to HBase without using HDFS because HBase can't create clusters and it has its own local file system.

I think you should see this link for good intro of
hadoop!

Upvotes: 10

Tanveer Dayan
Tanveer Dayan

Reputation: 506

Hadoop comprises of 2 main components.

  1. HDFS.
  2. Map-Reduce.

The explanation for both are as given below,

  1. HDFS is a file system which provides a reliable storage with high fault tolerance(using replication) by distributing the data across a set of nodes. It consists of 2 components, NameNode(Where the metadata about the file system is stored.) and datanodes(These can be multiple. They are where the actual distributed data is stored.)

  2. Map-Reduce is a set of 2 types of java daemons called "Job-Tracker" and "Task-Tracker". Typically, Job-Tracker daemon governs the jobs to be executed, whereas the Task-tracker daemons are the daemons which run on top of the data-nodes in which the data is distributed so that they can compute the program execution logic provided by the user specific to the data within the corresponding data-node.

Therefore, to summarize, HDFS is the storage component and Map-Reduce is the Execution component.

HBase on the Other Hand comprises of 2 components again,

  1. HMaster- Which consists of the metadata again.

  2. RegionServers- These are another set of daemons running on top of the data-node in the HDFS cluster to store and compute the database related data in the HDFS cluster(We store this in HDFS so that we exploit the core functionality of HDFS that is data replication and fault tolerance).

The difference between Map-Reduce Daemons and Hbase-RegionServer Daemons which run on top of HDFS is that, the Map-Reduce Daemons only perform Map-Reduce(Aggregation) type of jobs, whereas the Hbase-RegionServer daemons perform the DataBase related functionalities like read, write etc.

Upvotes: 1

Diego Pino
Diego Pino

Reputation: 11586

There's little to add to what I've already being said. Hadoop is a distributed filesystem (HDFS) and MapReduce (a framework for distributed computing). HBase is key-value data store built on top of Hadoop (meaning on top of HDFS).

The reason to use HBase instead of plain Hadoop is mainly to do random reads and writes. If you are using plain Hadoop you got to read the whole dataset whenever you want to run a MapReduce job.

I also find useful to import data to HBase if I'm working with thousands of small files.

I recommend you this talk by Todd Lipcon (Cloudera): "Apache HBase: an introduction" http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

Upvotes: 4

Roger
Roger

Reputation: 2953

one thing you should keep in mind - ACID properties are not yet supported by HBase. HBase does support Atomicity on a ROW LEVEL. You should try to read the MVCC implementation.

Also, read about LSM Vs B+ trees in RDBMS.

Upvotes: 1

Horse Voice
Horse Voice

Reputation: 8348

Its for the sole purpose of Distribution and speed of reads. What happens in Hbase is that The data gets auto "sharded" (partitioned) driven by your rowkey assignment. Its important to pick intelligent rowkeys because they get sorted binary. Keep in mind that the "sharded" subsets of data get split on to something called region servers. There can be multiple region servers on each machine in your cluster. If you don't distribute your data on a multi-node hadoop cluster, you wont be able to utilize the processing power of multiple machines searching in parallel on their respective subsets of data to return results to your client querying application. Hope this helps.

Upvotes: 1

David Gruzman
David Gruzman

Reputation: 8088

I would try to put terms in more strict order.
Hadoop is a set of integrated technologies. Most notable parts are:
HDFS - distributed file system specially built for massive data processing
MapReduce - framework implementing Map Reduce paradigm ove distributed file systems, where HDFS - one of them. It can work over other DFS - for example Amazon S3.
HBase - distributed sorted key-value map built on top of DFS. In best of my knowledge HDFS is only DFS implementation compatible with HBase. HBase needs append capability to write its write ahead log. For example DFS over amazon's s3 does not support it.

Upvotes: 2

Jacob Groundwater
Jacob Groundwater

Reputation: 6671

HBase can be used without Hadoop. Running HBase in standalone mode will use the local file system.

Hadoop is just a distributed file system with redundancy and the ability to scale to very large sizes. The reason arbitrary databases cannot be run on Hadoop is because HDFS is an append-only file system, and not POSIX compliant. Most SQL databases require the ability to seek and modify existing files.

HBase was designed with HDFS limitations in mind. CouchDB could in-theory be ported to run on HDFS because it also uses an append-only file format.

Upvotes: 2

Related Questions