anusngh
anusngh

Reputation: 97

Hive(Bigdata)- difference between bucketing and indexing

What is the main difference between bucketing and indexing of a table in Hive?

Upvotes: 2

Views: 2142

Answers (1)

dbustosp
dbustosp

Reputation: 4458

The main difference is the goal:

  • Indexing

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

Indexes become even more essential when the tables grow extremely large, and as you now undoubtedly know, Hive thrives on large tables.

  • Bucketing

It is usually used for join operations, because you can optimize joins by bucketing records by a specific 'key' or 'id'. In this way, when you want to do a join operation, records with the same 'key' will be in the same bucket and then the join operation will be faster. You can see this like a technique for decomposing data sets into more manageable parts. This link gives you 5 Tips for efficient Hive queries and one of them is about Bucketing.

Upvotes: 2

Related Questions