Reputation: 11
We have two data centers (DC1 in europe, DC2 in north america) of DatastaxEnterprise Solr cluster (version 4.5) on CentOs:
DC1: 2 nodes with rf set to 2
DC2: 1 nodes with rf set to 1
Every node has 2 cores and 4gb of RAM. We created only one keyspace, the 2 nodes of DC1 have 400MB each of data while the node in DC2 is empty.
if I start a nodetool repair on the node in DC2, the command works well for about 20/30 minutes and then it stops working remaining stuck.
In the logs of the node in DC2 i can read this:
WARN [NonPeriodicTasks:1] 2014-10-01 05:57:44,188 WorkPool.java (line 398) Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
ERROR [NonPeriodicTasks:1] 2014-10-01 05:57:44,190 CassandraDaemon.java (line 199) Exception in thread Thread[NonPeriodicTasks:1,5,main]
org.apache.solr.common.SolrException: java.lang.RuntimeException: Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.commit(CassandraDirectUpdateHandler.java:351)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.doCommit(AbstractSolrSecondaryIndex.java:994)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.forceBlockingFlush(AbstractSolrSecondaryIndex.java:139)
at org.apache.cassandra.db.index.SecondaryIndexManager.flushIndexesBlocking(SecondaryIndexManager.java:338)
at org.apache.cassandra.db.index.SecondaryIndexManager.maybeBuildSecondaryIndexes(SecondaryIndexManager.java:144)
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:113)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
at com.datastax.bdp.concurrent.WorkPool.doFlush(WorkPool.java:399)
at com.datastax.bdp.concurrent.WorkPool.flush(WorkPool.java:339)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.flushIndexUpdates(AbstractSolrSecondaryIndex.java:484)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.commit(CassandraDirectUpdateHandler.java:278)
... 12 more
WARN [commitScheduler-3-thread-1] 2014-10-01 05:58:47,351 WorkPool.java (line 398) Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
ERROR [commitScheduler-3-thread-1] 2014-10-01 05:58:47,352 SolrException.java (line 136) auto commit error...:org.apache.solr.common.SolrException: java.lang.RuntimeException: Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.commit(CassandraDirectUpdateHandler.java:351)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Timeout while waiting for workers when flushing pool {}. IndexCurrent timeout is Failure to flush may cause excessive growth of Cassandra commit log.
millis, consider increasing it, or reducing load on the node.
at com.datastax.bdp.concurrent.WorkPool.doFlush(WorkPool.java:399)
at com.datastax.bdp.concurrent.WorkPool.flush(WorkPool.java:339)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex.flushIndexUpdates(AbstractSolrSecondaryIndex.java:484)
at com.datastax.bdp.search.solr.handler.update.CassandraDirectUpdateHandler.commit(CassandraDirectUpdateHandler.java:278)
... 8 more
I tried increasing some timeouts in cassandra.yaml file, without luck. Thanks
Upvotes: 1
Views: 1197
Reputation: 7305
One way to reduce contention from solr indexing is to increase autoSoftCommit maxTime in your solrconfig.xml
<autoSoftCommit>
<maxTime>1000000</maxTime>
</autoSoftCommit>
Upvotes: 0
Reputation: 575
Two items that might be helpful:
The RuntimeException
you are seeing in the log is along the Lucene code path that commits index changes to disk, so I would certainly determine whether or not writing to disk is your bottleneck. (Are you using different physical disks for your data and commit log?)
The parameter that you probably want to tweak in the mean time is the one that controls WorkPool
flush timeouts in dse.yaml
is called flush_max_time_per_core
.
Upvotes: 1
Reputation: 461
Your nodes are fairly under specified for a DSE solr installation.
I would normally recommend at least 8 cores and at least 64 Gb of memory. Allocate the heap up to 12-14 Gb.
The following troubleshooting guide is pretty good:
Your current data load is small so you probably don't need the full whack for memory - I'd guess the bottleneck here is the cpus.
If you're not running 4.0.4 or 4.5.2 I'd get up to one of those versions.
Upvotes: 1