Reputation: 945
I've been using Elastic Search (powered by Lucene underneath) and its awesome. Blazing fast no matter what I throw at it.
I want to know why its fast now though. I understand its using an inverted index, and I partially understand whats that is based on several articles I've found and a few good youtube video's explaining it, but why is this so much significantly faster than a binary tree in Mysql or Mongo for instance? I know its a somewhat apples to oranges comparison, but I haven't been able to find any really good explanations (like a side by side) of how an inverted index works compared to how a binary tree index would work.
The only thing I've gathered so far is that based on indexing time, an inverted index will always be faster since it doesn't have to rebalance the leafs of the tree (same for a fractal index).
Upvotes: 3
Views: 5325
Reputation: 9964
This is all about trade-offs.
Lucene's index is very different from a B-Tree and based on write-once segments. For every segment, there are a terms dictionary and postings lists. Looking up a term requires to read some information about the term in the terms dictionary and them jumping to the start of its postings list, where you will be able to read sequentially the list of documents that match this term. Lookups in the terms dictionary are based on a variant of binary search in order to be able to run range queries efficiently.
Every time you add documents, a new segment is created (this is why Lucene is much better at bulk updates), and when the number of segments reaches a configurable threshold (called the merge factor), some segments are merged together (based on a MergePolicy).
This design allows Lucene to run many queries efficiently, but on the other hand, Lucene is still not very good at atomic single-document updates (because every commit requires a new segment to be flushed and because you need to reindex the whole document, no matter how big you modification is). There is ongoing work related to individual field updates but this is still experimental.
Upvotes: 7