user70192
user70192

Reputation: 14204

Elasticsearch - What is the indexing process?

I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.

Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.

Can someone please explain what happens during indexing and possibly point out some documentation?

Thank you!

Upvotes: 4

Views: 2890

Answers (1)

Rahul
Rahul

Reputation: 16335

Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process

Making Text Searchable

Every word in a text field needs to be searchable,

The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.

Updating Index :

First of all, please do note that a "lucene index is immutable"

Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.

Indexing Process

  • New documents are collected in an in-memory indexing buffer.
  • Every so often, the buffer is commited:

    • A new segment—a supplementary inverted index—is written to disk.
    • A new commit point is written to disk, which includes the name of the new segment.
    • The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
    • The new segment is opened, making the documents it contains visible to search.
  • The in-memory buffer is cleared, and is ready to accept new documents.

What happens in case of Delete

Segments are immutable, so documents cannot be removed from older segments.

When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.

When is it actually removed

In Segment Merging, deleted documents are purged from the filesystem.

References :

Elasticsearch Docs

Inverted Index

Lucene Talks

Upvotes: 4

Related Questions