Reputation: 33

Why there is Pig and Hive

I understood what are the components of Hadoop, but my question is: As an end user, how can I access a file in Hadoop without worrying about the data storage?

So when using Pig/Hive commands, should I worry if the data storage is HDFS or HBase?

Thank you

Upvotes: 0

Answers (4)

Shahir Ansari

Reputation: 1848

Pig is used when the data is unstructured and has no schema. Database recommended - HDFS.

Hive is used when the data is structured and has a schema available. Database recommended - Hbase.

Upvotes: 0

user3730028

Reputation: 230

Data in the Hadoop ecosystem needs to be stored in a distributed filesystem. HDFS is the most popular such filesystem.

But HDFS' value proposition is in offering very high sequential read and write (scan) throughput. What if you wanted fast random reads and writes ?

That's where HBase comes in. HBase sits on top of HDFS and enables fast random reads and writes.

But you store data to ask interesting questions about that data. That is where MapReduce comes in. You express your question in the MapReduce programming paradigm and it gets you the answer you need. But it's low-level and you need to be a programmer. Spark is an alternative to MapReduce - much better optimized for when you need to ask more sophisticated questions than MapReduce. Hive and Pig are higher-level abstractions than MapReduce. Hive let's you ask your question in SQL, and converts your SQL to MapReduce (or Spark) job. Although, with the growing popularity of Spark, you can skip Hive and use SparkSQL (Spark's Dataframe/Dataset APIs) which can also interpret SQL.

The difference between Hive and Pig is explained in this excellent post by Alan Gates (Pig project PMC member and author of Programming Pig).

Upvotes: 0

ikweesung

Reputation: 11

HDFS is a distributed file system just like fxm said.
Almost all of hadoop components built on HDFS.
HBase is a DB which store its data on distributed file system (hdfs, can be other fs).
Pig is a kind of programming language which will be generated to map reduce job.
hive is a kind of db built on HDFS, and its SQL will be generated to map reduce job.
Using udf of hive or pig, you can almost access any format data on hdfs.
excuse my poor English. :D

Upvotes: 0

merours

Reputation: 4106

First of all, HDFS is a file system and HBase a database so yes, you should take that into consideration, since you don't access them the same way.

Knowing that, Pig and Hive let you access the data much easier than in pure Java. For instance, Hive lets you query HBase in a close-to-SQL way.

In the same way, you can browse and manage files with pig almost like with a shell on a standart machine.

To conclude, you should not worry about how files are stored with Hadoop, but where they are stored (HDFS or HBase).

Upvotes: 1

Why there is Pig and Hive

Answers (4)

Related Questions