Reputation: 9
I've been learning about how traditional search engines like Lucene work, and I understand that they typically build an inverted index by tokenizing the text in the corpus. These tokens are then used directly in the index.
My question is: Why don't these search engines convert all tokens to unique integers (e.g., apple -> 435
, super -> 653
, etc.) before building the inverted index? There are only a limited set of words in the English language, say 1 million. It seems like using integers instead of text tokens could potentially reduce the index size, reduce the size of the total corpus since I am using integers instead of words and speed up searches (since handling numeric data should be faster).
Specifically, I'm curious about:
I'd appreciate any insights into the trade-offs and considerations that have led to the use of text tokens over integers in systems like Lucene. Maybe some caveats or perhaps any performance benefits I am thinking of would be negligible?
Edit: Considering the mapping is a manageable size for a specific use case, assume all integers that fit in 4 bytes, would be 4 million. Considering misspelled words and other symbols, they all fit under this limit of 4 million for a given corpus. What changes? It seems like switching to numbers is a free lunch in this case.
Upvotes: 0
Views: 58
Reputation: 582
Upvotes: 0