The Logging Agent installed in a VM suddenly stopped working

Question

About a month ago I installed the Ops Agent on a GCE's VM with the mongoDB integrationas an add-on. The /etc/google-cloud-ops-agent/config.yaml is configured like this:

metrics:
  receivers:
    mongodb:
      type: mongodb
      insecure: true
      password: **
      username: **
  service:
    pipelines:
      mongo:
        receivers: [mongodb]
logging:
  receivers:
    mongodb:
      type: mongodb
  service:
    pipelines:
      mongo:
        receivers: [mongodb]

Everything has been working fine (metrics and logs both) until about a week ago, when all the logs suddenly stopped being sent to Cloud Logging. The same thing happened to another VM in the same project yesterday. However, the monitoring agent is working fine. When I restart the ops agent it does several attempts until it fails for the logging. It doesn't seem that anything weird happened on the VM that could have triggered this failure.

I've already tried to uninstall and re-install the ops agent but it didn't work. I've executed the diagnose-agents.sh to get more insight, but even then I can't figure out why won't it start. The agent-info.txt has captured the following (mildly redacted for privacy):

    ===============================================================================
    METADATA
    ===============================================================================
    Trying to connect to logging.googleapis.com ... Connected OK
    Trying to connect to monitoring.googleapis.com ... Connected OK
    Using service account *************@developer.gserviceaccount.com.
    It looks like you are using a Google-managed service account.
    ===============================================================================
    OPS AGENT
    ===============================================================================
    ===== VERSION =====
    google-cloud-ops-agent 2.30.0~debian10 amd64 install ok installed
    
    ===== STATUS =====
    ● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
       Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
       Active: active (running) since Wed 2023-04-19 10:39:51 UTC; 31min ago
      Process: 19745 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
     Main PID: 19768 (otelopscol)
        Tasks: 11 (limit: 4915)
       Memory: 40.4M
       CGroup: /system.slice/google-cloud-ops-agent-opentelemetry-collector.service
               └─19768 /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml
    
    Apr 19 11:01:55 ************** otelopscol[19768]: 2023-04-19T11:01:55.540Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902115539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:02:55 ************** otelopscol[19768]: 2023-04-19T11:02:55.542Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902175539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:03:55 ************** otelopscol[19768]: 2023-04-19T11:03:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902235539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:04:55 ************** otelopscol[19768]: 2023-04-19T11:04:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902295539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:05:55 ************** otelopscol[19768]: 2023-04-19T11:05:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902355539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:06:55 ************** otelopscol[19768]: 2023-04-19T11:06:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902415539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:07:55 ************** otelopscol[19768]: 2023-04-19T11:07:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902475539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:08:55 ************** otelopscol[19768]: 2023-04-19T11:08:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902535539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:09:55 ************** otelopscol[19768]: 2023-04-19T11:09:55.541Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902595539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    Apr 19 11:10:55 ************** otelopscol[19768]: 2023-04-19T11:10:55.540Z        warn        internal/transaction.go:121        Failed to scrape Prometheus endpoint        {"scrape_timestamp": 1681902655539, "target_labels": "{__name__=\"up\", instance=\"0.0.0.0:20202\", job=\"logging-collector\"}"}
    
    ● google-cloud-ops-agent.service - Google Cloud Ops Agent
       Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
       Active: active (exited) since Wed 2023-04-19 10:39:51 UTC; 31min ago
      Process: 19722 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
      Process: 19744 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
     Main PID: 19744 (code=exited, status=0/SUCCESS)
    
    Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]:       default_pipeline:
    Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]:         receivers: [hostmetrics]
    Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]:         processors: [metrics_filter]
    Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]:       mongo:
    Apr 19 10:39:50 ********** google_cloud_ops_agent_engine[19722]:         receivers: [mongodb]
    Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Ports Check - Result: PASS
    Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Network Check - Result: PASS
    Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 API Check - Result: PASS
    Apr 19 10:39:51 ********** google_cloud_ops_agent_engine[19722]: 2023/04/19 10:39:51 Startup checks finished
    Apr 19 10:39:51 ********** systemd[1]: Started Google Cloud Ops Agent.
    
    ● google-cloud-ops-agent-diagnostics.service - Google Cloud Ops Agent - Diagnostics
       Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-diagnostics.service; disabled; vendor preset: enabled)
       Active: active (running) since Wed 2023-04-19 10:39:50 UTC; 31min ago
     Main PID: 19721 (google_cloud_op)
        Tasks: 9 (limit: 4915)
       Memory: 14.7M
       CGroup: /system.slice/google-cloud-ops-agent-diagnostics.service
               └─19721 /opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_diagnostics -config /etc/google-cloud-ops-agent/config.yaml
    
    Apr 19 10:39:50 ********* systemd[1]: Started Google Cloud Ops Agent - Diagnostics.
    
    ● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
       Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
       Active: failed (Result: signal) since Wed 2023-04-19 11:05:32 UTC; 6min ago
      Process: 27214 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
      Process: 27223 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (code=killed, signal=ABRT)
     Main PID: 27223 (code=killed, signal=ABRT)
    
    Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Service RestartSec=100ms expired, scheduling restart.
    Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 29.
    Apr 19 11:05:32 ********* systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
    Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
    Apr 19 11:05:32 ********* systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'signal'.
    Apr 19 11:05:32 ********* systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.

===== FLUENT BIT METRICS =====

--------------------------------------------------------------------------------
Could not determine fluent-bit PID
===============================================================================
LOGGING AGENT
===============================================================================
NOT INSTALLED

===============================================================================
MONITORING AGENT
===============================================================================
NOT INSTALLED

The health checks seem to be OK:

2023/04/19 10:39:50 ports_check.go:111: listening to 0.0.0.0:20202:
2023/04/19 10:39:50 ports_check.go:111: listening to 0.0.0.0:20201:
2023/04/19 10:39:50 ports_check.go:111: listening to [::]:20201:
2023/04/19 10:39:50 healthchecks.go:78: Ports Check - Result: PASS
2023/04/19 10:39:50 network_check.go:77: Logging API response status: 200 OK
2023/04/19 10:39:50 network_check.go:80: Request to the Logging API was successful.
2023/04/19 10:39:50 network_check.go:77: Monitoring API response status: 200 OK
2023/04/19 10:39:50 network_check.go:80: Request to the Monitoring API was successful.
2023/04/19 10:39:51 network_check.go:77: Packages API response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to packages.cloud.google.com was successful.
2023/04/19 10:39:51 network_check.go:77: dl.google.com response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to dl.google.com was successful.
2023/04/19 10:39:51 network_check.go:77: GCE Metadata Server response status: 200 OK
2023/04/19 10:39:51 network_check.go:80: Request to the GCE Metadata server was successful.
2023/04/19 10:39:51 healthchecks.go:78: Network Check - Result: PASS
2023/04/19 10:39:51 api_check.go:114: logging client was created successfully
2023/04/19 10:39:51 api_check.go:146: monitoring client was created successfully
2023/04/19 10:39:51 healthchecks.go:78: API Check - Result: PASS

And the error logs that the ops agent creates when trying to restart is the following:

[error] [output:stackdriver:stackdriver.0] cannot retrieve oauth2 token

Edit (added detailed info about scopes and role): It seems that it might be an authentication problem according to the error, but the service account it has been using since the start is the default one of the VM that has the "editor" role, which should grant it the necessary permissions. It hasn't been modified and it is the one being used right now... Nothing should have changed in that end. The VM has also the following access scopes (the ones by default, basically): Stackdriver logging API: write only Stackdriver Monitoring API: write only Google Storage: Reading writing Edit: I've noticed in the logs another error:

Apr 20 15:20:18 ******** fluent-bit[11081]: [2023/04/20 15:20:18] [engine] caught signal (SIGSEGV)
Apr 20 15:20:18 ******** fluent-bit[11081]: #0 0x7f5443844729 in __libc_malloc() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #1 0x5596ea843d9e in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #2 0x5596ea843df3 in flb_sds_create_len() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #3 0x5596ea8fbbd4 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #4 0x5596ea900889 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #5 0x5596eab448c6 in ???() at ???:0
Apr 20 15:20:18 ******** fluent-bit[11081]: #6 0xffffffffffffffff in ???() at ???:0
Apr 20 15:20:18 ******** systemd[1]: google-cloud-ops-agent-fluent-bit.service: Main process exited, code=killed, status=6/ABRT
Apr 20 15:20:18 ******** systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'signal'.

Seems like a memory allocation problem... Still no idea

Can anyone point me in the right direction? (or any direction, at this point) I'm quite stuck with this. Thanks!

The Logging Agent installed in a VM suddenly stopped working

Answers (0)

Related Questions