user1189332
user1189332

Reputation: 1941

Elasticsearch index cleanup

v Elasticsearch 5.6.*.

I'm looking at a way to implement a mechanism by which one of my index (that grows big in no time about 1 million documents per day) to manage the storage constraints automatically.

For example: I will define the max no of documents or max index size as a variable 'n'. I'd write a scheduler that checks whether 'n' is true. If true, then I'd want to delete the oldest 'x' documents (based on time).

I have a couple of questions here:

Apparently, I do not want to delete too much or too less. How would I know what 'x' is? Can I simply say to elasticsearch that "Hey delete the oldest documents worth 5GB" - My intent is to simply free up a fixed amount of storage. Is this possible?

Secondly, I'd want to know what's the best practice here? Obviously I don't want to invent a square wheel here and if there's anything (eg: Curator and I've been hearing about it only recently) that does the job then I'd be happy to use it.

Upvotes: 1

Views: 14840

Answers (3)

Kelby
Kelby

Reputation: 61

I came up with a rather simple bash script solution to clean up time-based indices in Elasticsearch which I thought I'd share in case anyone is interested. The Curator seems to be the standard answer for doing this but I really didn't want to install and manage a Python application with all the dependencies it requires. You can't get much simpler than a bash script executed via cron and it doesn't have any dependencies outside of core Linux.

#!/bin/bash

# Make sure expected arguments were provided
if [ $# -lt 3 ]; then
    echo "Invalid number of arguments!"
    echo "This script is used to clean time based indices from Elasticsearch. The indices must have a"
    echo "trailing date in a format that can be represented by the UNIX date command such as '%Y-%m-%d'."
    echo ""
    echo "Usage: `basename $0` host_url index_prefix num_days_to_keep [date_format]"
    echo "The date_format argument is optional and defaults to '%Y-%m-%d'"
    echo "Example: `basename $0` http://localhost:9200 cflogs- 7"
    echo "Example: `basename $0` http://localhost:9200 elasticsearch_metrics- 31 %Y.%m.%d"
    exit
fi

elasticsearchUrl=$1
indexNamePrefix=$2
numDaysDataToKeep=$3
dateFormat=%Y-%m-%d
if [ $# -ge 4 ]; then
    dateFormat=$4
fi

# Get the curent date in a 'seconds since epoch' format
curDateInSecondsSinceEpoch=$(date +%s)
#echo "curDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch"

# Subtract numDaysDataToKeep from current epoch value to get the last day to keep
let "targetDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch - ($numDaysDataToKeep * 86400)"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"

while : ; do
    # Subtract one day from the target date epoch
   let "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch - 86400"
   #echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"

   # Convert targetDateInSecondsSinceEpoch into a YYYY-MM-DD format
   targetDateString=$(date --date="@$targetDateInSecondsSinceEpoch" +$dateFormat)
   #echo "targetDateString=$targetDateString"

   # Format the index name using the prefix and the calculated date string
   indexName="$indexNamePrefix$targetDateString"
   #echo "indexName=$indexName"

   # First check if an index with this date pattern exists
    # Curl options:
    #  -s   silent mode. Don't show progress meter or error messages
    #  -w "%{http_code}\n" Causes curl to display the HTTP status code only after a completed transfer.
    #  -I Fetch the HTTP-header only in the response. For HEAD commands there is no body so this keeps curl from waiting on it.
    #  -o /dev/null Prevents the output in the response from being displayed. This does not apply to the -w output though.
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
   #echo "httpCode=$httpCode"
   if [ $httpCode -ne 200 ]
   then
      echo "Index $indexName does not exist. Stopping processing."
      break;
   fi

   # Send the command to Elasticsearch to delete the index. Save the HTTP return code in a variable
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -X DELETE $elasticsearchUrl/$indexName)
   #echo "httpCode=$httpCode"

   if [ $httpCode -eq 200 ]
   then
      echo "Successfully deleted index $indexName."
    else
      echo "FAILURE! Delete command failed with return code $httpCode. Continuing processing with next day."
      continue;
   fi

   # Verify the index no longer exists. Should return 404 when the index isn't found.
   httpCode=$(curl -o /dev/null -s -w "%{http_code}\n" -I -X HEAD "$elasticsearchUrl/$indexName")
   #echo "httpCode=$httpCode"
   if [ $httpCode -eq 200 ]
   then
      echo "FAILURE! Delete command responded successfully, but index still exists. Continuing processing with next day."
      continue;
   fi

done

Upvotes: 2

Val
Val

Reputation: 217274

In your case, the best practice is to work with time-based indices, either daily, weekly or monthly indices, whichever makes sense for the amount of data you have and the retention you want. You also have the possibility to use the Rollover API in order to decide when a new index needs to be created (based on time, number of documents or index size)

It is much easier to delete an entire index than delete documents matching certain conditions within an index. If you do the latter, the documents will be deleted but the space won't be freed until the underlying segments get merged. Whereas if you delete an entire time-based index, then you're guaranteed to free up space.

Upvotes: 3

untergeek
untergeek

Reputation: 863

I responded to the same question at https://discuss.elastic.co/t/elasticsearch-efficiently-cleaning-up-the-indices-to-save-space/137019

If your index is always growing, then deleting documents is not best practices. It sounds like you have time-series data. If true, then what you want is time-series indices, or better yet, rollover indices.

5GB is also a rather small amount to be purging, as a single Elasticsearch shard can healthily grow to 20GB - 50GB in size. Are you storage constrained? How many nodes do you have?

Upvotes: 1

Related Questions