Jacket
Jacket

Reputation: 874

Elasticsearch index size is 40% larger in 5.x than in 1.x

I have an old cluster running Elasticsearch 1.4.4. My cluster is containing ~11 billion documents and the size of all primaries is around 4TB.

I am now in the process of upgrading to Elasticsearch 5.2.2, which of course means reindexing my data. I have a separate cluster where this is happening at the moment. I am reindexing from my source database, since I have _all and _source disabled on the original index.

I have now reindexed around 750 million documents and noticed that my new index size is already 350GB. I did some math and it looks like the index will grow to around 5.5TB when fully indexed. That's 1.5TB more than the 1.4.4 index. I wasn't expecting this. On the contrary I was expecting a decrease in size, since I've removed several attributes. Is this a normal thing or did I do something wrong? Is there a different default settings in 5.2.2 that can contribute to this growth?

1.4.4 index settings:

{
  "index": {
    "refresh_interval": "30s",
    "number_of_shards": "20",
    "creation_date": "1426251049131",
    "analysis": {
      "analyzer": {
        "default": {
          "filter": [
            "icu_folding",
            "icu_normalizer"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      }
    },
    "uuid": "WdgnCLyITgmpb4DROegV3Q",
    "version": {
      "created": "1040499"
    },
    "number_of_replicas": "1"
  }
}

1.4.4 index mapping:

{
  "article": {
    "_source": {
      "enabled": false
    },
    "_all": {
      "enabled": false
    },
    "properties": {
      "date": {
        "format": "dateOptionalTime",
        "type": "date",
        "doc_values": true
      },
      "has_enclosures": {
        "type": "boolean"
      },
      "feed_subscribers": {
        "type": "integer",
        "doc_values": true
      },
      "feed_language": {
        "index": "not_analyzed",
        "type": "string"
      },
      "author": {
        "norms": {
          "enabled": false
        },
        "analyzer": "keyword",
        "type": "string"
      },
      "has_pictures": {
        "type": "boolean"
      },
      "title": {
        "norms": {
          "enabled": false
        },
        "type": "string"
      },
      "content": {
        "norms": {
          "enabled": false
        },
        "type": "string"
      },
      "has_video": {
        "type": "boolean"
      },
      "url": {
        "index": "not_analyzed",
        "type": "string"
      },
      "feed_canonical": {
        "type": "boolean"
      },
      "feed_id": {
        "type": "integer",
        "doc_values": true
      }
    }
  }
}

5.2.2 index settings:

{
  "articles": {
    "settings": {
      "index": {
        "refresh_interval": "-1",
        "number_of_shards": "40",
        "provided_name": "articles",
        "creation_date": "1489604158595",
        "analysis": {
          "analyzer": {
            "default": {
              "filter": [
                "icu_folding",
                "icu_normalizer"
              ],
              "type": "custom",
              "tokenizer": "icu_tokenizer"
            }
          }
        },
        "number_of_replicas": "0",
        "uuid": "LOeOcZb_TMCX6E_86uMyXQ",
        "version": {
          "created": "5020299"
        }
      }
    }
  }
}

5.2.2 index mapping:

{
  "articles": {
    "mappings": {
      "article": {
        "_all": {
          "enabled": false
        },
        "_source": {
          "enabled": false
        },
        "properties": {
          "author": {
            "type": "text",
            "norms": false,
            "analyzer": "keyword"
          },
          "content": {
            "type": "text",
            "norms": false
          },
          "date": {
            "type": "date"
          },
          "feed_canonical": {
            "type": "boolean"
          },
          "feed_id": {
            "type": "integer"
          },
          "feed_subscribers": {
            "type": "integer"
          },
          "title": {
            "type": "text",
            "norms": false
          },
          "url": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

Any help will be much appreciated since full reindexing on this cluster takes about 30 days... Thanks!

Upvotes: 2

Views: 1509

Answers (2)

Roman
Roman

Reputation: 2118

My guess would be doc_values. Since elastic 2.0, doc_values are enabled by default, meaning your 5.2 mapping creates doc_values for more fields than your 1.4 mapping, and that consumes disk space.

Upvotes: 1

Adonis
Adonis

Reputation: 4818

I see you have modified the refreshing interval and put the number of replicas at 0, if using spinning disk, you can add to the elasticsearch.yml to increase the indexing speed:

index.merge.scheduler.max_thread_count: 1

If you don't care about searching yet, the following on your ES5 cluster could also help:

PUT /_cluster/settings
{
    "transient" : {
        "indices.store.throttle.type" : "none" 
    }
}

Be sure you have swapping disable. How much memory is allocated to your nodes in the ES5 cluster? (You should use half of the total available memory of a node, with a cap at 32 GB due to memory addressing limit of Elasticsearch).

Also this increase in size might be because Elasticsearch does not merge its segments often, and will wait a calmer period to merge them and thus, reducing the size on the disk. As long as the reindexation is not over, it's a little bit early to judge on the overall size of the new index.

A few articles below that could help:

Upvotes: 0

Related Questions