Is Apache Hive used more for the programming language or for the data warehouse aspects?

Question

I used to think that Hive was just a SQL-like programming language used to make writing MapReduce-type jobs easier (i.e., a SQL-like version of Pig/Pig Latin). I'm reading more about it now, though, and apparently it's actually a full data warehouse infrastructure.

Is one of these use cases more common? That is, is it primarily used for the data warehouse infrastructure it provides, or more for the SQL-like interface? Or are both aspects of equal utility and importance?

(I'm asking because I'm trying to figure out what parts of Hive I should focus on learning about.)

batman · Accepted Answer

That's exactly what I used to think too. Now that I've had about a month's experience with Hive, I now find that it's a great ETL tool... for a data warehouse later down the line.

Hive doesn't compare with MDX. Hive is very row-based and doesn't allow a lot of the messier operations that SQL or MDX (Multidimensional Expression Language, common in BI tools) are masters at.

We're using Hive as an ETL tool to integrate our different flat file data sources and reduce the amount of data we have to upload to a SQL-based data warehouse.

If that data only has a half-life spanning a couple of weeks, then we can keep the size of our database relatively manageable, always able to reproduce the reports later on from Hive.

Is Apache Hive used more for the programming language or for the data warehouse aspects?

Answers (2)

Related Questions