Guy
Guy

Reputation: 11

Solr 4.7.2 not recovering - "ClusterState says we are the leader, but locally we don't think so"

One morning my Solr server broke with this message below, it didn't recover on its own - had to restart it - Is that a 4.7.2 known issue?

My topology is very simple: single Solr with a single shard replica, and an embedded ZK (-zkrun).

Could it be related to a 4.8 fix: SOLR-5799: When registering as the leader, if an existing ephemeral registration exists, wait a short time to see if it goes away. (Mark Miller)

ERROR - 2015-03-18 04:48:15.326; org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so
INFO  - 2015-03-18 04:48:15.327; org.apache.solr.update.processor.LogUpdateProcessor; [quick-results-collection] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 1
ERROR - 2015-03-18 04:48:15.328; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: ClusterState says we are the leader (http://9.70.210.149:8983/solr/quick-results-collection), but locally we don't think so. Request came from null
    at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
    at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96)
    at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166)
    at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
    at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
    at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
    at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190)
    at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
    at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
    at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
    at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

Upvotes: 1

Views: 3612

Answers (1)

TMBT
TMBT

Reputation: 1183

According to this link:

This can be caused by several instances sharing the same state directories, meaning that there's a mismatch between what's on disk (if a second instance spins up and writes that it's a slave to the current cluster state) and what's present in zookeeper.

Maybe you have an instance of Jetty still running somewhere that you thought was shut down, but really wasn't. Or at least that's what this person discovered:

The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.

It doesn't seem to be a very common error, so it's regrettably difficult to search for. From what I can glean from poking around mailing lists and the like, some people have solved the problem by increasing zkClientTimeout for the Zookeeper client. This especially seems to be helpful if there's an underlying task taking a long time, like GC for example.

Upvotes: 2

Related Questions