Reputation: 30495
I used to think that Hive was just a SQL-like programming language used to make writing MapReduce-type jobs easier (i.e., a SQL-like version of Pig/Pig Latin). I'm reading more about it now, though, and apparently it's actually a full data warehouse infrastructure.
Is one of these use cases more common? That is, is it primarily used for the data warehouse infrastructure it provides, or more for the SQL-like interface? Or are both aspects of equal utility and importance?
(I'm asking because I'm trying to figure out what parts of Hive I should focus on learning about.)
Upvotes: 1
Views: 848
Reputation: 6289
Hive doesn't support updates. In our implementation we used straight MapReduce jobs for populating data warehouse and Hive for making exports for further processing or importing into relational data warehouses. We also used it as an intermediary for a BI reporting tool.
Upvotes: 0
Reputation: 1447
That's exactly what I used to think too. Now that I've had about a month's experience with Hive, I now find that it's a great ETL tool... for a data warehouse later down the line.
Hive doesn't compare with MDX. Hive is very row-based and doesn't allow a lot of the messier operations that SQL or MDX (Multidimensional Expression Language, common in BI tools) are masters at.
We're using Hive as an ETL tool to integrate our different flat file data sources and reduce the amount of data we have to upload to a SQL-based data warehouse.
If that data only has a half-life spanning a couple of weeks, then we can keep the size of our database relatively manageable, always able to reproduce the reports later on from Hive.
Upvotes: 2