Reputation: 79
I am trying to report Cassandra 3.0 metrics to Graphite server using metrics-graphite as suggested here http://www.datastax.com/dev/blog/pluggable-metrics-reporting-in-cassandra-2-0-2. When there is no load on the cluster everything works fine and all metrics are reported properly. But if some load occurs, I receive following exception in system.log:
ERROR [metrics-graphite-reporter-1-thread-1] 2016-07-13 08:21:23,580 ScheduledReporter.java:119 - RuntimeException thrown from GraphiteReporter#report. Exception was suppressed.
java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed
at org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) ~[apache-cassandra-3.0.7.jar:3.0.7]
at org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) ~[apache-cassandra-3.0.7.jar:3.0.7]
at com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:265) ~[metrics-graphite-3.1.2.jar:3.1.2]
at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:179) ~[metrics-graphite-3.1.2.jar:3.1.2]
at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~[metrics-core-3.1.0.jar:3.1.0]
at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) ~[metrics-core-3.1.0.jar:3.1.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_91]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_91]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
This message is repeated every time the reporter tries to get metrics on every Cassandra node and some metrics become unavailable. In order to receive the metrics again, I have to restart all Cassandra nodes, that is very impractical. I tried different metrics-graphite versions from 3.1.0 to 3.1.2 with the same issue.
Upvotes: 1
Views: 1379
Reputation: 11
Here is a workaround that suppresses this error, if you can live without reporting Table and keyspace metrics to Graphite.
We are using DataStax Enterprise 5.0.1, which contains Cassandra 3.0.7.1159. I encountered this error in a brand new install (not an upgrade), using both metrics-graphite-2.2.0.jar and metrics-graphite-3.1.2.jar, so I don't think the error depends on the version of the Coda Hale/Yammer GraphiteReporter plug-in.
Researching the related CASSANDRA Jira tickets, it seems this error is caused by Cassandra 3.0 metric values becoming larger than the GraphiteReporter can handle.
In my metrics-reporter-config.yaml, I was using a white list wildcard value, so all metrics were reported to Graphite, like this:
graphite:
-
period: 60
timeunit: 'SECONDS'
prefix: 'dev.servers'
hosts:
- host: 'cassandra-1'
port: 2003
predicate:
color: "white"
useQualifiedName: false
patterns:
- ".*"
The workaround that we discovered is that, if we switched to using a specific black list as shown below (determined by process of elimination), such that we prevent Cassandra Table and keyspace metrics from being reported, the error would go away:
graphite:
-
period: 60
timeunit: 'SECONDS'
prefix: 'dev.servers'
hosts:
- host: 'cassandra-1'
port: 2003
predicate:
color: "black"
useQualifiedName: true
patterns:
- "^org.apache.cassandra.metrics.Table.+"
- "^org.apache.cassandra.metrics.keyspace.+"
I had to restart Cassandra after making this change. After the restart, the error message no longer appeared in the Cassandra system.log file, and the specified metric groups causing the error messages were no longer reported.
Upvotes: 1