Awinash
Awinash

Reputation: 91

Hadoop with Relational Database

I am new to Hadoop and would like to know Hadoop works in a scenario.

During the creation of Dynamic Web project, I used to store and get data from MySQL database by sending queries from Java/C#.

I use Hadoop services in my Project and does hadoop provide any in build database system where we can store data and retrieve it when required, instead of using an external databases.

Thanks in advance.

Upvotes: 3

Views: 5120

Answers (3)

Remus Rusanu
Remus Rusanu

Reputation: 294277

Hadoop jobs use an InputFormat to create the InputSplits. While the most examples use a file input and use HDFS fragments as input split, the concept is abstract and can be mapped to other inputs. A typical example is the already existing DataDrivenDBInputSplit which represents a set of rows in a table. This kind of input formats/input splits are what is used by Apache Sqoop (a command-line tool offering several commands) to read database inputs.

The gist of it is that is absolutely possible to use DB as input for your job, you have to realize that what you will achieve is to unleash a cluster of computing nodes to slam your relational database with requests for a range of rows. It is very likely that your back end won't handle the load or, at best, handle it slowly. The power of Hadoop comes from the integration of processing with streamlined local storage access and you are asking explicitly to give it up.

So, if your goal is to move data between RDBMS and HDFS, Scoop got you covered in the following cases

Upvotes: 1

Tariq
Tariq

Reputation: 34184

Hadoop doesn't provide any builtin DB. It is just 2 things :

  • A distributed FS (HDFS)
  • A distributed processing framework (MapReduce. I'll call it MR in short)

I'm assuming that you would require very quick response since you are dealing with a web service. IMHO, Hadoop(HDFS to be precise), or any other FS for that matter, won't be a suitable choice in such a scenario. Reason being HDFS lacks the random/read capability, which is very much essential for any web project.

Same holds true for Hive. Although it manages data in a fashion similar to RDBMSs, it's actually not a RDBMS. The underlying storage mechanism is still HDFS files. Moreover when you issue a Hive query to fetch results, the query first gets converted into a MR job and then produces the result resulting in slow response.

Your safest bet would be to go with HBase. It is definitely a better choice when you need random, realtime read/write access to your data, as in your case. Although it's not a part of the Hadoop platform, it was built ground up to be used with Hadoop. Works on top of your existing HDFS cluster and can be operated on directly through different HBase APIs(fits in your case) or through MR(not for real time stuff. Fits when you need to batch process huge amounts of data). Easy to setup and use with no requirement of additional infrastructure.

One important thing to note here is that HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might have to work a bit on your design initially.

Apart from HBase you have some other options as well, like Cassandra, which is also a NoSQL DB.

HTH

Upvotes: 6

Suvarna Pattayil
Suvarna Pattayil

Reputation: 5239

Hadoop Core does not contain any database.

From the Hadoop Wiki

Databases are wonderful. Issue an SQL SELECT call against an indexed/tuned database and the response comes back in milliseconds. Want to change that data? SQL UPDATE and the change is in. Hadoop does not do this.

Hadoop stores data in files, and does not index them. If you want to find something, you have to run a MapReduce job going through all the data. This takes time, and means that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database (i.e. you have reached the technical limits, not just that you don't want to pay for a database license). With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data. With many machines trying to write to the database, you can't get locks on it. Here the idea of vaguely-related files in a distributed filesystem can work.

There is a high performance column-table database that runs on top of Hadoop HDFS: Apache HBase. This is a great place to keep the results extracted from your original data.

You can also use Apache Hive which gives you the feel of relational databases (although there are shortcomings) like MySQL. Behind the scenes it uses Mapreduce to help you leverage Hadoop for processing Big Data. Please note that Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates

Upvotes: 1

Related Questions