logstash indexing files multiple times?

Question

I'm using logstash (v2.3.3-1) to index about ~800k documents from S3 to AWS ElasticSearch, and some documents are being indexed 2 or 3 times, instead of only once.

The files are static (nothing is updating them or touching them) and they're very small (each is roughly 1.1KB).

The process takes a very long time to run on a t2.micro (~1day).

The config I'm using is:

input {

s3 {
bucket => "$BUCKETNAME"
codec => "json"
region => "$REGION"
access_key_id => '$KEY'
secret_access_key => '$KEY'
type => 's3'
}
}

filter {
  if [type] == "s3" {
    metrics {
      meter => "events"
      add_tag => "metric"
    }
  }
}

output {

if "metric" in [tags] {
    stdout { codec => line {
        format => "rate: %{[events][rate_1m]}"
           }
      }
} else {
     amazon_es {
      hosts => [$HOST]
      region => "$REGION"
      index => "$INDEXNAME"
      aws_access_key_id => '$KEY'
      aws_secret_access_key => '$KEY'
      document_type => "$TYPE"
     }

     stdout { codec => rubydebug }
}
}

I've run this twice now with the same problem (into different ES indices) and the files that are being indexed >1x are different each time.

Any suggestions gratefully received!

Alain Collins · Accepted Answer

The s3 input is very fragile. It records the time of the last file processed, so any files that share the same timestamp will not be processed and multiple logstash instances cannot read from the same bucket. As you've seen, it's also painfully slow to determine which files to process (a good portion of the blame goes to amazon here).

s3 only works for me when I use a single logstash to read the files, and then delete (or backup to another bucket/folder) the files to keep the original bucket as empty as possible, and then setting sincedb_path to /dev/null.

logstash indexing files multiple times?

Answers (1)

Related Questions