Reputation: 91
I am getting these errors during ES logging using fluentd. I'm using fluentd logging on k8s for application logging, we are handling 100M (around 400 tps) and getting this issue. I'm using M6g.2xlarge(8 core and 32 RAM) AWS instances 3 master and 20 data nodes. Under 200 tps everything is working fine after 200 getting these issue. There is a lag in Kibana and data loss on ES.
ES version: 7.15.0
fluentd version: 1.12.4
My logging flow: Fluentd > ES > Kibana
Error Log:
2022-02-04 00:37:53 +0530 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2022-02-04 00:37:56 +0530 chunk="5d721d8e59e44f5bbbf4aa5e267f7e3e" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"eslogging-prod.abc.com\", :port=>80, :scheme=>\"http\"}): [429] {\"error\":{\"root_cause\":[{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=1621708431, replica_bytes=0, all_bytes=1621708431, coordinating_operation_bytes=48222318, max_coordinating_and_primary_bytes=1655072358]\"}],\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=1621708431, replica_bytes=0, all_bytes=1621708431, coordinating_operation_bytes=48222318, max_coordinating_and_primary_bytes=1655072358]\"},\"status\":429}
Config file:
<source>
@type tail
@id in_tail_container_logs
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_key @timestamp
time_format %Y-%m-%dT%H:%M:%S.%N%z
keep_time_key true
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
skip_container_metadata "true"
</filter>
<filter kubernetes.var.log.containers.**prod**>
@id filter_concat
@type concat
key log
use_first_timestamp true
multiline_end_regexp /\n$/
separator ""
</filter>
<filter kubernetes.var.log.containers.**prod**>
@type record_transformer
<record>
log_json ${record["log"]}
</record>
remove_keys $.kubernetes.pod_id,$.kubernetes.container_image
</filter>
<filter kubernetes.var.log.containers.**prod**>
@type parser
@log_level debug
key_name log_json
#reserve_time true
reserve_data true
remove_key_name_field true
emit_invalid_record_to_error true
<parse>
@type json
</parse>
</filter>
<match kubernetes.var.log.containers.**prod**>
@type elasticsearch
@log_level info
include_tag_key true
suppress_type_name true
host "eslogging-prod.abc.com"
port 80
reload_connections false
logstash_format true
logstash_prefix ${$.kubernetes.labels.app}
reconnect_on_error true
num_threads 8
request_timeout 2147483648
compression_level best_compression
compression gzip
include_timestamp true
utc_index false
time_key_format "%Y-%m-%dT%H:%M:%S.%N%z"
time_key time
reload_on_failure true
prefer_oj_serializer true
bulk_message_request_threshold -1
slow_flush_log_threshold 30.0
log_es_400_reason true
<buffer tag, $.kubernetes.labels.app>
@type file
path /var/log/fluentd-buffers/kubernetes-apps.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 8
flush_interval 5s
retry_forever true
retry_max_interval 30
chunk_limit_size 200M
queue_limit_length 512
overflow_action throw_exception
</buffer>
</match>
Upvotes: 0
Views: 10986
Reputation: 301
You are encountering an index rate limiting by Elasticsearch. Have a look at this article for hints: https://chenriang.me/elasticsearch-bulk-insert-rejection.html I suggest to try to reduce the chunk size in Fluentd and increase the retries.
Upvotes: 0