Reputation: 874
I have an old cluster running Elasticsearch 1.4.4
.
My cluster is containing ~11 billion documents and the size of all primaries is around 4TB
.
I am now in the process of upgrading to Elasticsearch 5.2.2
, which of course means reindexing my data. I have a separate cluster where this is happening at the moment. I am reindexing from my source database, since I have _all
and _source
disabled on the original index.
I have now reindexed around 750 million documents and noticed that my new index size is already 350GB
. I did some math and it looks like the index will grow to around 5.5TB
when fully indexed. That's 1.5TB more than the 1.4.4
index. I wasn't expecting this. On the contrary I was expecting a decrease in size, since I've removed several attributes. Is this a normal thing or did I do something wrong? Is there a different default settings in 5.2.2
that can contribute to this growth?
1.4.4 index settings:
{
"index": {
"refresh_interval": "30s",
"number_of_shards": "20",
"creation_date": "1426251049131",
"analysis": {
"analyzer": {
"default": {
"filter": [
"icu_folding",
"icu_normalizer"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
},
"uuid": "WdgnCLyITgmpb4DROegV3Q",
"version": {
"created": "1040499"
},
"number_of_replicas": "1"
}
}
1.4.4 index mapping:
{
"article": {
"_source": {
"enabled": false
},
"_all": {
"enabled": false
},
"properties": {
"date": {
"format": "dateOptionalTime",
"type": "date",
"doc_values": true
},
"has_enclosures": {
"type": "boolean"
},
"feed_subscribers": {
"type": "integer",
"doc_values": true
},
"feed_language": {
"index": "not_analyzed",
"type": "string"
},
"author": {
"norms": {
"enabled": false
},
"analyzer": "keyword",
"type": "string"
},
"has_pictures": {
"type": "boolean"
},
"title": {
"norms": {
"enabled": false
},
"type": "string"
},
"content": {
"norms": {
"enabled": false
},
"type": "string"
},
"has_video": {
"type": "boolean"
},
"url": {
"index": "not_analyzed",
"type": "string"
},
"feed_canonical": {
"type": "boolean"
},
"feed_id": {
"type": "integer",
"doc_values": true
}
}
}
}
5.2.2 index settings:
{
"articles": {
"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "40",
"provided_name": "articles",
"creation_date": "1489604158595",
"analysis": {
"analyzer": {
"default": {
"filter": [
"icu_folding",
"icu_normalizer"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
},
"number_of_replicas": "0",
"uuid": "LOeOcZb_TMCX6E_86uMyXQ",
"version": {
"created": "5020299"
}
}
}
}
}
5.2.2 index mapping:
{
"articles": {
"mappings": {
"article": {
"_all": {
"enabled": false
},
"_source": {
"enabled": false
},
"properties": {
"author": {
"type": "text",
"norms": false,
"analyzer": "keyword"
},
"content": {
"type": "text",
"norms": false
},
"date": {
"type": "date"
},
"feed_canonical": {
"type": "boolean"
},
"feed_id": {
"type": "integer"
},
"feed_subscribers": {
"type": "integer"
},
"title": {
"type": "text",
"norms": false
},
"url": {
"type": "keyword"
}
}
}
}
}
}
Any help will be much appreciated since full reindexing on this cluster takes about 30 days... Thanks!
Upvotes: 2
Views: 1509
Reputation: 2118
My guess would be doc_values. Since elastic 2.0, doc_values are enabled by default, meaning your 5.2 mapping creates doc_values for more fields than your 1.4 mapping, and that consumes disk space.
Upvotes: 1
Reputation: 4818
I see you have modified the refreshing interval and put the number of replicas at 0, if using spinning disk, you can add to the elasticsearch.yml to increase the indexing speed:
index.merge.scheduler.max_thread_count: 1
If you don't care about searching yet, the following on your ES5 cluster could also help:
PUT /_cluster/settings
{
"transient" : {
"indices.store.throttle.type" : "none"
}
}
Be sure you have swapping disable. How much memory is allocated to your nodes in the ES5 cluster? (You should use half of the total available memory of a node, with a cap at 32 GB due to memory addressing limit of Elasticsearch).
Also this increase in size might be because Elasticsearch does not merge its segments often, and will wait a calmer period to merge them and thus, reducing the size on the disk. As long as the reindexation is not over, it's a little bit early to judge on the overall size of the new index.
A few articles below that could help:
Upvotes: 0