Kushal Bansal
Kushal Bansal

Reputation: 143

hash index vs inverted index

As I understand, hash/inverted index maps values/ words to the records/ documents respectively. However, insert complexity in hash index is low (as it adds a new bucket in case of overflow), but more in inverted index (due to maintaining sorted list of document IDs). Does it mean that they are essentially the same, except the implementation?

Upvotes: 3

Views: 3000

Answers (3)

Raymond Pang
Raymond Pang

Reputation: 797

Inverted index is just a concept (opposite to forward index). It is about whether having word or document as the key in a key-value pair.

And, hash table is just one of the implementations for indexing, alternative to tree, etc.

Therefore, you could implement inverted index with hash table. You could also implement inverted indexes with a tree instead.

insert complexity in hash index is low (as it adds a new bucket in case of overflow), but more in inverted index (due to maintaining sorted list of document IDs)

For inverted index, maintaining sorted list is optional. It is required only if the search engine support advanced features, such as searching for multiple consecutive words

The Sweet Apple is sweet. It grows on an apple tree. I love sweet apple.

Sweet: 1, 4, 13
Apple: 2, 9, 14


Query: "Sweet Apple"
Returns (1,2), (13,14)

Therefore, for a full-text search engine that supports consecutive word search, an inverted index implemented with hash table will still have to have the bucket entries sorted. So, implementing index tree with a hash table in this use case does not have any advantages.

Upvotes: 0

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20354

An inverted index is a data structure that maps content (such as tokens) to their locations in documents. The major benefit of an inverted index is that the whole data collection doesn't have to be searched through to look up interesting documents.

Consider a book. The index at the end of it is an example of an inverted index. But it is not a hash index.

A hash index is an inverted index implemented using a hashtable. This image shows how they store data:

enter image description here

Other data structures can be used to implemented inverted indexes, such as trees. Binary trees can be used but are often too simplistic so rb-trees or b-trees are used instead. Tree-based indexes are a little harder to understand so pictures doesn't help much. They have properties that make them preferrable to hash-based indexes when you have large amounts of data, such as being easier to update and having better worst-case performance.

Upvotes: 2

Abhishek Jain
Abhishek Jain

Reputation: 4518

From what I understand, hash index is used for a completely different use-case/scenario in comparison to an inverted index. The hash index is just a mapping from an index key to the exact location of the given row in memory (primarily used for memory optimized tables in relational databases) whereas an inverted index is actually the mapping from a word to the documents in which it is contained.

So, if we look at it, a word could be contained in a number of documents and the documents would be shared by many such words. Hence many keys in case of an inverted index, point to document ids which are common across many such keys whereas in case of an hash index, the data which the keys point to i.e. the row data might be completely unrelated to each other.

Thus, they are not the same as they address completely unrelated scenarios and are implemented very differently.

For more info on inverted index, you might refer to the post here: BigData: Inverted Index

Upvotes: 2

Related Questions