Sra1
Sra1

Reputation: 670

Documents are automatically getting deleted in Elasticsearch after insertion

I created an index in Elasticsearch with the following settings. After inserting data into the index using Bulk API, the docs.deleted count is continuously increasing. Does this mean the documents are automatically getting deleted, if so what did i do wrong ?

PUT /inc_index/
{
  "mappings": {
    "store": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "description": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
        },
        "category": {
          "type": "string"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 5,
      "number_of_replicas" : 1
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

The output of "GET /_cat/indices?v" is as shown below, the "docs.deleted" is continuously increasing:

health status index    pri rep docs.count docs.deleted store.size pri.store.size  
green  open   inc_index  5   1   2009877       584438      6.8gb          3.6gb

Upvotes: 8

Views: 7123

Answers (3)

nonNumericalFloat
nonNumericalFloat

Reputation: 1777

This can happen if your machine is too slow

If it's too slow handling the (bulk)insertion, for example when your documents are pretty big or if there are just too many of them at once.

After slowing down the indexing process there was no document loss anymore - still strange why the documents not being inserted where listed under "deleted" which seems to me as they where indeed processed.

This occured to me using Elasticdump and could be resolved by setting the --limit option to a lower number.

Upvotes: 2

Shweta
Shweta

Reputation: 23

ElasticSearch indexes have been composed of “segments”. Since segments have a policy of "write once", when we delete/update any document from ElasticSearch, it is not actually deleted, only marked as deleted and increases the count in "doc.deleted".

The more segments means slower searches and more memory used. Elasticsearch solves this problem by merging segments in the background. Small segments are merged into bigger segments, which, in turn, are merged into even bigger segments...while merging those segments if there are any documents which are marked as deleted, it doesn't copy that doc in the bigger segment. And Once merging has finished, the old segments are deleted. That's why there is further decrease in "doc.deleted" value.

Upvotes: 2

Andrei Stefan
Andrei Stefan

Reputation: 52368

If your bulk operations also include updates to existing documents (insert/update to documents with the same ID), then this is normal. In Elasticsearch, an update is a combo of delete+insert operations: https://www.elastic.co/guide/en/elasticsearch/guide/current/update-doc.html

And the deleted documents you see there are documents marked as deleted. When the Lucene segments merging happens, the deleted documents will be physically removed from disk.

Upvotes: 11

Related Questions