Reputation: 521
Our Set-up in Kubernetes is as follows:
Opentelemetry agent (both java and python used and metrics from both have the same issue)
Opentelemetry-collector - version 0.77.0 (we previously were on 0.72.0 and that had the same problem) It's using the otlp receiver, like this:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
It has the following processors configured:
processors:
memory_limiter:
limit_mib: 4000
spike_limit_mib: 800
check_interval: 1s
batch:
send_batch_size: 10000
send_batch_max_size: 11000
timeout: 10s
#Delete unnecessary attributes from our metrics
resource:
attributes:
- key: telemetry.sdk.name
action: delete
- key: telemetry.sdk.version
action: delete
- key: telemetry.sdk.language
action: delete
- key: telemetry.auto.version
action: delete
- key: container.id
action: delete
- key: process.command.args
action: delete
k8sattributes/default:
Previously we had none of these processors configured and the problem was exactly the same
the opentelemetry collector has the prometheus exporter set like this:
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
resource_to_telemetry_conversion:
enabled: true
enable_open_metrics: true
namespace: prometheus
and then servicemonitors configured like this:
serviceMonitor:
enabled: true
metricsEndpoints:
- port: metrics
interval: 15s
prometheusMetricsEndpoints:
- port: app-metrics
interval: 15s
honorLabels: true
Then finally we have prometheus which scrapes for these metrics based on those servicemonitors and Grafana to visualise everything.
The problem we have is that for some of our metrics we randomly see that datapoints are missing... we usually get a datapoint every 15 seconds but sometimes it's just not there, here are some examples:
for the logback_events_total metric we see many empty datapoints:
and another example with http_server_requests_milliseconds_count
this time zoomed in a bit more to make it more clear showing points only, no lines:
in this one you can clearly see we get a point every 15 seconds but sometimes we miss one or two of these points.
I have looked at the metrics for machine_scrape_error and container_scrape_error and as you can see there are no dropped data points but more importantly no scrape errors being reported
This problem seems to only happen for Counter based metrics... for others we don't seem to get this problem at all (e.g jvm_threads_live). Our Opentelemetry collector pod is only using roughly 10% CPU and 20% memory Our prometheus is scaled to 6 pods with resource usage well under 50% of the requested amount.
I am running out of ideas about what could cause this, I understand that a few dropped scrapes is normal but this seems like more than would be acceptable. We have tried turning on debug logs but haven't found anything useful in the logs. I assume these data points could be going missing at different points along the way (agent, collector or prometheus). Does anyone have any advice about how to start tracking this down?
Adding another graph as it was mentioned in the comments:
scrape duration seconds plotted alongside another metric that was dropping data points:
scrape duration seconds is the green line with no gaps
Edit: 30th May 2023 Still having the same issue, after enabling error logs on the otel collector we are seeing lines like these:
* collected metric prometheus_http_server_duration_milliseconds label:<name:"framework" value:"spring" > label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/v1/vins/licenseplate/{licensePlateValue}/country/{country}/latest" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"204" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-22-89.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.20.233" > label:<name:"k8s_pod_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:02:56 +0000 UTC" > label:<name:"k8s_pod_uid" value:"f3b6ef2e-31f5-4f4e-871a-ed34acfc247a" > label:<name:"label_team" value:"team-name" > label:<name:"net_host_name" value:"service-name-svc.service-name" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-XX:+UseSerialGC\",\"-Dopentelemetry.environment=development\",\"-javaagent:/opentelemetry/opentelemetry.jar\",\"-jar\",\"/app/service-name-fa541655.jar\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 19.0.2+7" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"19.0.2+7" > label:<name:"service_name" value:"service-name" > histogram:<sample_count:12 sample_sum:3068.996335 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:11 upper_bound:10 > bucket:<cumulative_count:11 upper_bound:25 > bucket:<cumulative_count:11 upper_bound:50 > bucket:<cumulative_count:11 upper_bound:75 > bucket:<cumulative_count:11 upper_bound:100 > bucket:<cumulative_count:11 upper_bound:250 > bucket:<cumulative_count:11 upper_bound:500 > bucket:<cumulative_count:11 upper_bound:750 > bucket:<cumulative_count:11 upper_bound:1000 > bucket:<cumulative_count:11 upper_bound:2500 > bucket:<cumulative_count:12 upper_bound:5000 > bucket:<cumulative_count:12 upper_bound:7500 > bucket:<cumulative_count:12 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
* collected metric prometheus_jvm_threads_states label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"5b8bfdc684-29xd5" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-15-117.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.11.123" > label:<name:"k8s_pod_name" value:"service-name-5b8bfdc684-29xd5" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:45:54 +0000 UTC" > label:<name:"k8s_pod_uid" value:"bcc28b31-7026-47d0-bf89-e996e8b138a2" > label:<name:"label_team" value:"team-name" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"service-name" > label:<name:"state" value:"terminated" > gauge:<value:0 > has help "The current number of threads having NEW state" but should have "The current number of threads"
* collected metric prometheus_http_server_duration_milliseconds label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"74f9654b8-nkpdn" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/status" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"200" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-20-192.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.18.204" > label:<name:"k8s_pod_name" value:"service-name-74f9654b8-nkpdn" > label:<name:"k8s_pod_start_time" value:"2023-05-30 04:50:43 +0000 UTC" > label:<name:"k8s_pod_uid" value:"402efd2d-1daf-4672-b3cf-535a8f76d52a" > label:<name:"net_host_name" value:"10.11.18.204" > label:<name:"net_host_port" value:"8080" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-Xms512m\",\"-Xmx512m\",\"-Dnewrelic.environment=qa\",\"-javaagent:/open-telemetry/opentelemetry-javaagent.jar\",\"org.springframework.boot.loader.JarLauncher\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.6+10" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"tom" > histogram:<sample_count:1 sample_sum:701.935696 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:25 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:75 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:250 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:750 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2500 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:7500 > bucket:<cumulative_count:1 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
This would suggest a mis-match between the description of the metric being received and the actual spec of that metric?
but, it seems strange that this would only be for some data points in a metric and not every one... does this help to narrow down the issue
Upvotes: 2
Views: 3217
Reputation: 521
I have a "fix" for this, it feels a bit like a workaround so I won't mark the question as answered...
It was these errors that were relevant for the missing data points:
* collected metric prometheus_http_server_duration_milliseconds label:<name:"framework" value:"spring" > label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/v1/vins/licenseplate/{licensePlateValue}/country/{country}/latest" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"204" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-22-89.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.20.233" > label:<name:"k8s_pod_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:02:56 +0000 UTC" > label:<name:"k8s_pod_uid" value:"f3b6ef2e-31f5-4f4e-871a-ed34acfc247a" > label:<name:"label_team" value:"team-name" > label:<name:"net_host_name" value:"service-name-svc.service-name" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-XX:+UseSerialGC\",\"-Dopentelemetry.environment=development\",\"-javaagent:/opentelemetry/opentelemetry.jar\",\"-jar\",\"/app/service-name-fa541655.jar\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 19.0.2+7" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"19.0.2+7" > label:<name:"service_name" value:"service-name" > histogram:<sample_count:12 sample_sum:3068.996335 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:11 upper_bound:10 > bucket:<cumulative_count:11 upper_bound:25 > bucket:<cumulative_count:11 upper_bound:50 > bucket:<cumulative_count:11 upper_bound:75 > bucket:<cumulative_count:11 upper_bound:100 > bucket:<cumulative_count:11 upper_bound:250 > bucket:<cumulative_count:11 upper_bound:500 > bucket:<cumulative_count:11 upper_bound:750 > bucket:<cumulative_count:11 upper_bound:1000 > bucket:<cumulative_count:11 upper_bound:2500 > bucket:<cumulative_count:12 upper_bound:5000 > bucket:<cumulative_count:12 upper_bound:7500 > bucket:<cumulative_count:12 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
* collected metric prometheus_jvm_threads_states label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"5b8bfdc684-29xd5" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-15-117.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.11.123" > label:<name:"k8s_pod_name" value:"service-name-5b8bfdc684-29xd5" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:45:54 +0000 UTC" > label:<name:"k8s_pod_uid" value:"bcc28b31-7026-47d0-bf89-e996e8b138a2" > label:<name:"label_team" value:"team-name" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"service-name" > label:<name:"state" value:"terminated" > gauge:<value:0 > has help "The current number of threads having NEW state" but should have "The current number of threads"
* collected metric prometheus_http_server_duration_milliseconds label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"74f9654b8-nkpdn" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/status" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"200" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-20-192.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.18.204" > label:<name:"k8s_pod_name" value:"service-name-74f9654b8-nkpdn" > label:<name:"k8s_pod_start_time" value:"2023-05-30 04:50:43 +0000 UTC" > label:<name:"k8s_pod_uid" value:"402efd2d-1daf-4672-b3cf-535a8f76d52a" > label:<name:"net_host_name" value:"10.11.18.204" > label:<name:"net_host_port" value:"8080" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-Xms512m\",\"-Xmx512m\",\"-Dnewrelic.environment=qa\",\"-javaagent:/open-telemetry/opentelemetry-javaagent.jar\",\"org.springframework.boot.loader.JarLauncher\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.6+10" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"tom" > histogram:<sample_count:1 sample_sum:701.935696 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:25 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:75 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:250 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:750 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2500 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:7500 > bucket:<cumulative_count:1 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
For whatever reason that I still don't understand, some of the data points were being received with descriptions that didn't match what was expected. In the metrics endpoint you can see what is expected by looking at the help text:
# HELP http_server_duration_milliseconds The duration of the inbound HTTP request
# TYPE http_server_duration_milliseconds histogram
For each different error I saw in the logs, I was able to use the transform processor to ensure that the description always matches the Help text:
transform:
error_mode: ignore
metric_statements:
- context: metric
statements:
- set(description, "The duration of the inbound HTTP request") where name == "http.server.duration"
- set(description, "The current number of threads having NEW state") where name == "jvm.threads.states"
- set(description, "The number of concurrent HTTP requests that are currently in-flight") where name == "http.server.active_requests"
- set(description, "") where name == "http.server.requests"
- set(description, "") where name == "http.server.requests.max"
- set(description, "Number of log events that were enabled by the effective log level") where name == "logback.events"
I no longer have any errors or dropped data points
Upvotes: 3