Rattle
Rattle

Reputation: 9

How do traditional search engines like Lucene handle tokenization and indexing, and why don't they use integer mappings for tokens?

I've been learning about how traditional search engines like Lucene work, and I understand that they typically build an inverted index by tokenizing the text in the corpus. These tokens are then used directly in the index.

My question is: Why don't these search engines convert all tokens to unique integers (e.g., apple -> 435, super -> 653, etc.) before building the inverted index? There are only a limited set of words in the English language, say 1 million. It seems like using integers instead of text tokens could potentially reduce the index size, reduce the size of the total corpus since I am using integers instead of words and speed up searches (since handling numeric data should be faster).

Specifically, I'm curious about:

  1. Compression Efficiency: Can numeric data be compressed as efficiently as text? Would there be significant gains in compression by using integers?
  2. Handling New Tokens: How are new tokens managed in the traditional method, and how would this process change if integers were used instead of text? I am assuming there would be no change.
  3. Impact on Ranking and Relevance Calculations: Would using integer tokens instead of text tokens affect the ranking and relevance calculations (e.g., TF-IDF, BM25), my assumption is there would be no change again.

I'd appreciate any insights into the trade-offs and considerations that have led to the use of text tokens over integers in systems like Lucene. Maybe some caveats or perhaps any performance benefits I am thinking of would be negligible?

Edit: Considering the mapping is a manageable size for a specific use case, assume all integers that fit in 4 bytes, would be 4 million. Considering misspelled words and other symbols, they all fit under this limit of 4 million for a given corpus. What changes? It seems like switching to numbers is a free lunch in this case.

Upvotes: 0

Views: 58

Answers (1)

Mathew
Mathew

Reputation: 582

  1. Elasticsearch not only handles English words, but also words in other languages. Even for English, irregular words must be handled, such as misspelled words, aapple, which is also legal. Considering all the above situations, if digital mapping is used, this number will be very large.
  2. In terms of compression efficiency, numbers are indeed better than words. For example, we can use delta-encode and bit packing to compress directly.

Upvotes: 0

Related Questions