Zachary McArthur
Zachary McArthur

Reputation: 85

Grafana negative spikes in latency query

I have a Grafana dashboard that is measuring latency of a Kafka topic per partition in minutes using this query here:

avg by (topic, consumergroup, environment, partition)(kafka_consumer_lag_millis{environment="production",topic="topic.name",consumergroup="consumer.group.name"}) / 1000 / 60

The graph is working fine but we're seeing negative spikes in the graph that doesn't make a lot of sense to us. Does anyone know potentially what could be causing these spikes?

enter image description here

Upvotes: 0

Views: 932

Answers (1)

gabriel119435
gabriel119435

Reputation: 6842

This is more of a guess than an accurate answer based here. let's suppose in a very simple manner we have 2 metrics being measured, and their subtraction is the number sent to prometheus:

lag = offset-producer - offset-consumer

while the producer offset is measured with a pooling mechanism, the consumer offset is measured with direct synchronous requests (to whatever other inner place has this values). this way, we could have outdated values for the producer. example:

instant  | producer | consumer
t1       | 10       | 0
t2       | 30       | 15
t3       | 200      | 70

if we had always updated values, we should have:

instant | lag
t1      | 10 - 0   = 10
t2      | 30 - 15  = 15
t3      | 200 - 70 = 130

let's suppose our offset producer was one measurement behind on t2 due to the long pooling period:

l(t1) = p(t1) - c(t1)
l(t2) = p(t1) - c(t2)
l(t3) = p(t2) - c(t3)

this would produce:

instant | lag
t1      | 10 - 0  = 10
t2      | 10 - 15 = -5
t3      | 30 - 70 = -40

and there's your negative value: when the diff increases and your pooling rate of the positive value is bigger than prometheus' pooling rate, you get the negative value to be bigger than older positive value.

now to really answer your question we need to check prometheus' kafka client code to check if the pooling rate is editable to make it smaller until negative values vanish (or instead just set it smaller than prometheus rate directly)

Upvotes: 2

Related Questions