Reputation: 33
I understood what are the components of Hadoop, but my question is: As an end user, how can I access a file in Hadoop without worrying about the data storage?
So when using Pig/Hive commands, should I worry if the data storage is HDFS or HBase?
Thank you
Upvotes: 0
Views: 331
Reputation: 1848
Pig is used when the data is unstructured and has no schema. Database recommended - HDFS.
Hive is used when the data is structured and has a schema available. Database recommended - Hbase.
Upvotes: 0
Reputation: 230
Data in the Hadoop ecosystem needs to be stored in a distributed filesystem. HDFS is the most popular such filesystem.
But HDFS' value proposition is in offering very high sequential read and write (scan) throughput. What if you wanted fast random reads and writes ?
That's where HBase comes in. HBase sits on top of HDFS and enables fast random reads and writes.
But you store data to ask interesting questions about that data. That is where MapReduce comes in. You express your question in the MapReduce programming paradigm and it gets you the answer you need. But it's low-level and you need to be a programmer. Spark is an alternative to MapReduce - much better optimized for when you need to ask more sophisticated questions than MapReduce. Hive and Pig are higher-level abstractions than MapReduce. Hive let's you ask your question in SQL, and converts your SQL to MapReduce (or Spark) job. Although, with the growing popularity of Spark, you can skip Hive and use SparkSQL (Spark's Dataframe/Dataset APIs) which can also interpret SQL.
The difference between Hive and Pig is explained in this excellent post by Alan Gates (Pig project PMC member and author of Programming Pig).
Upvotes: 0
Reputation: 11
Almost all of hadoop components built on HDFS.
HBase is a DB which store its data on distributed file system (hdfs, can be other fs).
Pig is a kind of programming language which will be generated to map reduce job.
hive is a kind of db built on HDFS, and its SQL will be generated to map reduce job.
Using udf of hive or pig, you can almost access any format data on hdfs.
excuse my poor English. :D
Upvotes: 0
Reputation: 4106
First of all, HDFS is a file system and HBase a database so yes, you should take that into consideration, since you don't access them the same way.
Knowing that, Pig and Hive let you access the data much easier than in pure Java. For instance, Hive lets you query HBase in a close-to-SQL way.
In the same way, you can browse and manage files with pig almost like with a shell on a standart machine.
To conclude, you should not worry about how files are stored with Hadoop, but where they are stored (HDFS or HBase).
Upvotes: 1