Reputation: 11
A few days ago our main Fluentd aggregator instance was wrongfully re-created. Because of it, our td-agent buffer instances lost connection to it for ~12 hours and accumulated a lot of logs. After fixing the connectivity and some td-agent restarts, we were able to resume correct functionality after like 2 days.
The problem is, a bunch of logs from those 2 days weren't sent by the td-agent instance to the aggregator instance once it resumed working. That causes a big loss of our tracking data in these 2 days. Here's our td-agent.conf:
<source>
# legacy trackevent schema: missing field1-field7
type tail
log_level error
path /var/log/php-fpm/track_access_hlog
pos_file /var/log/php-fpm/legacy-trackevent.pos
time_format %d/%b/%Y:%H:%M:%S
format /^\[(?<dateday>[^ ]+) .+\] (?:\S+) \/(?<event>(?!(.*PING)|(PUSH_ALL)|(PUSH_ARRIVAL)|(PUSH_MUTE)|(PUSH_NOT)|(PUSH_RECEIVED)|(PUSH_SENT)|(VIDEO.*))[^\/]+)\/(?:[^\/]*)\/(?<userid>[^\/]*)\/(?<session>[^\/]*)\/(?<broadcastid>[^\/]*)\/(?<doorid>[^\/]*)\/(?<userlevel>[^\/]*)\/(?<broadcastscount>[^\/]*)\/(?<unspentcoins>[^\/]*)\/(?:[^\/]*)\/(?<extradata>[^\/]*)\/(?<coins>[^\/]*)\/(?<points>[^\/]*)\/(?<platform>[^\/]*)\/(?<sourceid>[^\/]*)\/(?<domain>[^\/]*)\/trpxl\.gif.*$/
time_key dateday
tag tracklive.ip
</source>
<source>
# "current" trackevent schema: requires field1-field7
type tail
log_level error
path /var/log/php-fpm/track_access_hlog
pos_file /var/log/php-fpm/track-access.log.pos # This is where you record file position
time_format %d/%b/%Y:%H:%M:%S
format /^\[(?<dateday>[^ ]+) .+\] (?:\S+) \/(?<event>(?!(.*PING)|(PUSH_ALL)|(PUSH_ARRIVAL)|(PUSH_MUTE)|(PUSH_NOT)|(PUSH_RECEIVED)|(PUSH_SENT)|(VIDEO.*))[^\/]+)\/(?:[^\/]*)\/(?<userid>[^\/]*)\/(?<session>[^\/]*)\/(?<broadcastid>[^\/]*)\/(?<doorid>[^\/]*)\/(?<userlevel>[^\/]*)\/(?<broadcastscount>[^\/]*)\/(?<unspentcoins>[^\/]*)\/(?:[^\/]*)\/(?<extradata>[^\/]*)\/(?<coins>[^\/]*)\/(?<points>[^\/]*)\/(?<platform>[^\/]*)\/(?<sourceid>[^\/]*)\/(?<domain>[^\/]*)\/(?<field1>[^\/]*)\/(?<field2>[^\/]*)\/(?<field3>[^\/]*)\/(?<field4>[^\/]*)\/(?<field5>[^\/]*)\/(?<field6>[^\/]*)\/(?<field7>[^\/]*)\/trpxl\.gif.*$/
time_key dateday
tag tracklive.ip
</source>
<match tracklive.**>
@type forward
@id forward_output
phi_failure_detector false
send_timeout 10s
expire_dns_cache 60s
buffer_queue_limit 256
buffer_chunk_limit 16m
buffer_type file
buffer_path /var/log/td-agent/buffer/
<server>
name fluentd-aggregator-box
host fluentd.ourdomainxxxx.com
</server>
</match>
<source>
type tail
log_level error
format /^\[(?<dateday>[^ ]+) .+\][^\/]+\/(?<event>PUSH[^\/]*)\/[^\/]*\/(?<userid>[^\/]*)\/(?<session>[^\/]*)\/(?<broadcastid>[^\/]*)\/(?<doorid>[^\/]*)\/(?<userlevel>[^\/]*)\/[^\/]*\/(?<unspentcoins>[^\/]*)\/[^\/]*\/(?<push_origin>[^\/]*)\/[^\/]*\/[^\/]*\/(?<platform>[^\/]*)\/(?<sourceid>[^\/]*)\/(?<deviceid>[^\/]*)\/(?<push_type>[^\/]*)\/(?<channel_id>[^\/]*)\/(?<reserved128>[^\/]*)\/(?<reserved256>[^\/]*)\/(?<mobile_app_version>[^\/]*)\/(?<language>[^\/]*)\/(?<legacy_app_version>[^\/]*)\/trpxl.gif.*/
path /var/log/php-fpm/track_access_hlog
pos_file /var/log/php-fpm/push-event.pos
time_format %d/%b/%Y:%H:%M:%S # nginx default
tag push_events.ip
</source>
<match push_events.**>
type forward
phi_failure_detector false
send_timeout 36000s
expire_dns_cache 60s
buffer_queue_limit 1800
buffer_chunk_limit 20M
buffer_type file
buffer_path /var/log/td-agent/buffer_push_events/
<server>
host fluentd.ourdomainxxxx.com
</server>
</match>
Currently its successfully reading from source, creating buffer .log
files in the buffer_ping_events
folder (i.e.) and sending them to our aggregator box. But the older files, from May 9th to May 11th, are in the same folder but are not picked up and sent to the aggregator.
Considering that the logs follow a sequential sequence in hexadecimal, if I rename them with a higher number will they be picked up and sent by the td-agent? If so, is it possible that that will cause future problems?
Thanks in advance
Changed some configuration, restarted td-agent, expected all logs files to being sent, but only newly generated ones are working. Also tried forcing buffed files with the command:
kill -USR1 'cat /var/run/td-agent/td-agent.pid'
To no avail
Upvotes: 1
Views: 123