Reputation: 143
As I understand, hash/inverted index maps values/ words to the records/ documents respectively. However, insert complexity in hash index is low (as it adds a new bucket in case of overflow), but more in inverted index (due to maintaining sorted list of document IDs). Does it mean that they are essentially the same, except the implementation?
Upvotes: 3
Views: 3000
Reputation: 797
Inverted index is just a concept (opposite to forward index). It is about whether having word or document as the key in a key-value pair.
And, hash table is just one of the implementations for indexing, alternative to tree, etc.
Therefore, you could implement inverted index with hash table. You could also implement inverted indexes with a tree instead.
insert complexity in hash index is low (as it adds a new bucket in case of overflow), but more in inverted index (due to maintaining sorted list of document IDs)
For inverted index, maintaining sorted list is optional. It is required only if the search engine support advanced features, such as searching for multiple consecutive words
The Sweet Apple is sweet. It grows on an apple tree. I love sweet apple.
Sweet: 1, 4, 13
Apple: 2, 9, 14
Query: "Sweet Apple"
Returns (1,2), (13,14)
Therefore, for a full-text search engine that supports consecutive word search, an inverted index implemented with hash table will still have to have the bucket entries sorted. So, implementing index tree with a hash table in this use case does not have any advantages.
Upvotes: 0
Reputation: 20354
An inverted index is a data structure that maps content (such as tokens) to their locations in documents. The major benefit of an inverted index is that the whole data collection doesn't have to be searched through to look up interesting documents.
Consider a book. The index at the end of it is an example of an inverted index. But it is not a hash index.
A hash index is an inverted index implemented using a hashtable. This image shows how they store data:
Other data structures can be used to implemented inverted indexes, such as trees. Binary trees can be used but are often too simplistic so rb-trees or b-trees are used instead. Tree-based indexes are a little harder to understand so pictures doesn't help much. They have properties that make them preferrable to hash-based indexes when you have large amounts of data, such as being easier to update and having better worst-case performance.
Upvotes: 2
Reputation: 4518
From what I understand, hash index is used for a completely different use-case/scenario in comparison to an inverted index. The hash index is just a mapping from an index key to the exact location of the given row in memory (primarily used for memory optimized tables in relational databases) whereas an inverted index is actually the mapping from a word to the documents in which it is contained.
So, if we look at it, a word could be contained in a number of documents and the documents would be shared by many such words. Hence many keys in case of an inverted index, point to document ids which are common across many such keys whereas in case of an hash index, the data which the keys point to i.e. the row data might be completely unrelated to each other.
Thus, they are not the same as they address completely unrelated scenarios and are implemented very differently.
For more info on inverted index, you might refer to the post here: BigData: Inverted Index
Upvotes: 2