Reputation: 1
About a month ago I installed the Ops Agent on a GCE's VM with the mongoDB integrationas an add-on. The /etc/google-cloud-ops-agent/config.yaml is configured like this:
metrics:
receivers:
mongodb:
type: mongodb
insecure: true
password: **
username: **
service:
pipelines:
mongo:
receivers: [mongodb]
logging:
receivers:
mongodb:
type: mongodb
service:
pipelines:
mongo:
receivers: [mongodb]
Everything has been working fine (metrics and logs both) until about a week ago, when all the logs suddenly stopped being sent to Cloud Logging. The same thing happened to another VM in the same project yesterday. However, the monitoring agent is working fine. When I restart the ops agent it does several attempts until it fails for the logging. It doesn't seem that anything weird happened on the VM that could have triggered this failure.
I've already tried to uninstall and re-install the ops agent but it didn't work. I've executed the diagnose-agents.sh to get more insight, but even then I can't figure out why won't it start. The agent-info.txt has captured the following (mildly redacted for privacy):
===============================================================================
METADATA
===============================================================================
Trying to connect to logging.googleapis.com ... Connected OK
Trying to connect to monitoring.googleapis.com ... Connected OK
Using service account *************@developer.gserviceaccount.com.
It looks like you are using a Google-managed service account.
===============================================================================
OPS AGENT
===============================================================================
===== VERSION =====
google-cloud-ops-agent 2.30.0~debian10 amd64 install ok installed
===== STATUS =====
● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Active: active (running) since Wed 2023-04-19 10:39:51 UTC; 31min ago
Process: 19745 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Main PID: 19768 (otelopscol)
Tasks: 11 (limit: 4915)
Memory: 40.4M
CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
└─19768 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml
Apr 19 11:01:55 ************** otelopscol[19768]: 2023-04-19T11:01:55.540Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902115539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:02:55 ************** otelopscol[19768]: 2023-04-19T11:02:55.542Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902175539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:03:55 ************** otelopscol[19768]: 2023-04-19T11:03:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902235539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:04:55 ************** otelopscol[19768]: 2023-04-19T11:04:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902295539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:05:55 ************** otelopscol[19768]: 2023-04-19T11:05:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902355539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:06:55 ************** otelopscol[19768]: 2023-04-19T11:06:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902415539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:07:55 ************** otelopscol[19768]: 2023-04-19T11:07:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902475539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:08:55 ************** otelopscol[19768]: 2023-04-19T11:08:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902535539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:09:55 ************** otelopscol[19768]: 2023-04-19T11:09:55.541Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902595539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
Apr 19 11:10:55 ************** otelopscol[19768]: 2023-04-19T11:10:55.540Z warn internal/transaction.go:121 Failed to scrape Prometheus endpoint {"scrape_timestamp": 1681902655539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2023-04-19 10:39:51 UTC; 31min ago
Process: 19722 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
Process: 19744 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 19744 (code=exited, status=0/SUCCESS)
Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]: default_pipeline:
Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]: receivers: [hostmetrics]
Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]: processors: [metrics_filter]
Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]: mongo:
Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]: receivers: [mongodb]
Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Ports Check - Result: PASS
Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Network Check - Result: PASS
Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 API Check - Result: PASS
Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Startup checks finished
Apr 19 10:39:51 ********** systemd[1]: Started Google Cloud Ops Agent.
● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2023-04-19 10:39:50 UTC; 31min ago
Main PID: 19721 (google_cloud_op)
Tasks: 9 (limit: 4915)
Memory: 14.7M
CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
└─19721 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/google-cloud-ops-agent/config.yaml
Apr 19 10:39:50 ********* systemd[1]: Started Google Cloud Ops Agent - Diagnostics.
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
Active: failed (Result: signal) since Wed 2023-04-19 11:05:32 UTC; 6min ago
Process: 27214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 27223 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (code=killed, signal=ABRT)
Main PID: 27223 (code=killed, signal=ABRT)
Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 29.
Apr 19 11:05:32 ********* systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'signal'.
Apr 19 11:05:32 ********* systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
===== FLUENT BIT METRICS =====
--------------------------------------------------------------------------------
Could not determine fluent-bit PID
===============================================================================
LOGGING AGENT
===============================================================================
NOT INSTALLED
===============================================================================
MONITORING AGENT
===============================================================================
NOT INSTALLED
The health checks seem to be OK:
2023/04/19 10:39:50 ports_check.go:111: listening to 0.0.0.0:20202:
2023/04/19 10:39:50 ports_check.go:111: listening to 0.0.0.0:20201:
2023/04/19 10:39:50 ports_check.go:111: listening to [::]:20201:
2023/04/19 10:39:50 healthchecks.go:78: Ports Check - Result: PASS
2023/04/19 10:39:50 network_check.go:77: Logging API response status: 200 OK
2023/04/19 10:39:50 network_check.go:80: Request to the Logging API was successful.
2023/04/19 10:39:50 network_check.go:77: Monitoring API response status: 200 OK
2023/04/19 10:39:50 network_check.go:80: Request to the Monitoring API was successful.
2023/04/19 10:39:51 network_check.go:77: Packages API response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to packages.cloud.google.com was successful.
2023/04/19 10:39:51 network_check.go:77: dl.google.com response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to dl.google.com was successful.
2023/04/19 10:39:51 network_check.go:77: GCE Metadata Server response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to the GCE Metadata server was successful.
2023/04/19 10:39:51 healthchecks.go:78: Network Check - Result: PASS
2023/04/19 10:39:51 api_check.go:114: logging client was created successfully
2023/04/19 10:39:51 api_check.go:146: monitoring client was created successfully
2023/04/19 10:39:51 healthchecks.go:78: API Check - Result: PASS
And the error logs that the ops agent creates when trying to restart is the following:
[error] [output:stackdriver:stackdriver.0] cannot retrieve oauth2 token
Edit (added detailed info about scopes and role): It seems that it might be an authentication problem according to the error, but the service account it has been using since the start is the default one of the VM that has the "editor" role, which should grant it the necessary permissions. It hasn't been modified and it is the one being used right now... Nothing should have changed in that end. The VM has also the following access scopes (the ones by default, basically): Stackdriver logging API: write only Stackdriver Monitoring API: write only Google Storage: Reading writing Edit: I've noticed in the logs another error:
Apr 20 15:20:18 ******** fluent-bit[11081]: [2023/04/20 15:20:18] [engine] caught signal (SIGSEGV)
Apr 20 15:20:18 ******** fluent-bit[11081]: #0 0x7f5443844729 in __libc_malloc() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #1 0x5596ea843d9e in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #2 0x5596ea843df3 in flb_sds_create_len() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #3 0x5596ea8fbbd4 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #4 0x5596ea900889 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #5 0x5596eab448c6 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #6 0xffffffffffffffff in ???() at ???:0
Apr 20 15:20:18 ******** systemd[1]: google-cloud-ops-agent-fluent-bit.service: Main process exited, code=killed, status=6/ABRT
Apr 20 15:20:18 ******** systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'signal'.
Seems like a memory allocation problem... Still no idea
Can anyone point me in the right direction? (or any direction, at this point) I'm quite stuck with this. Thanks!
Upvotes: 0
Views: 487