jagadeesh k
jagadeesh k

Reputation: 119

org.infinispan.util.concurrent.TimeoutException: Replication timeout for "node-name"

We have three services that have to be in cluster. So we used Infinispan for cluster the nodes and share data between those services. After successfully restarted, sometimes I am getting an exception and got received "View Changed" event for lefted node in my one of other node. Actually all the node were running. I could not figure the cause of this.

I am using Infinispan 8.1.3-distributed cache, jgroups-3.4

org.infinispan.util.concurrent.TimeoutException: Replication timeout for sipproxy-16964
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:765)
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$80(JGroupsTransport.java:599)
            at org.infinispan.remoting.transport.jgroups.JGroupsTransport$$Lambda$9/1547262581.apply(Unknown Source)
            at java.util.concurrent.CompletableFuture$ThenApply.run(CompletableFuture.java:717)
            at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:193)
            at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2345)
            at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:46)
            at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.call(SingleResponseFuture.java:17)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    2017-08-22 04:44:52,902 INFO  [JGroupsTransport] (ViewHandler,ISPN,transport_manager-48870) ISPN000094: Received new cluster view for channel ISPN: [transport_manager-48870|3] (2) [transport_manager-48870, mediaproxy-47178]
    2017-08-22 04:44:52,949 WARN  [PreferAvailabilityStrategy] (transport-thread-transport_manager-p4-t24) ISPN000313: Cache mediaProxyResponseCache lost data because of abrupt leavers [sipproxy-16964]
    2017-08-22 04:44:52,951 WARN  [ClusterTopologyManagerImpl] (transport-thread-transport_manager-p4-t24) ISPN000197: Error updating cluster member list
    java.lang.IllegalArgumentException: There must be at least one node with a non-zero capacity factor
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.checkCapacityFactors(DefaultConsistentHashFactory.java:57)
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:74)
            at org.infinispan.distribution.ch.impl.DefaultConsistentHashFactory.updateMembers(DefaultConsistentHashFactory.java:26)
            at org.infinispan.topology.ClusterCacheStatus.updateCurrentTopology(ClusterCacheStatus.java:431)
            at org.infinispan.partitionhandling.impl.PreferAvailabilityStrategy.onClusterViewChange(PreferAvailabilityStrategy.java:56)
            at org.infinispan.topology.ClusterCacheStatus.doHandleClusterView(ClusterCacheStatus.java:337)
            at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:397)
            at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:314)
            at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:571)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)

jgroups.xml:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
    <TCP bind_addr="131.10.20.16"
         bind_port="8010" port_range="10"
         recv_buf_size="20000000"
         send_buf_size="640000"
         loopback="false"
         max_bundle_size="64k"
         bundler_type="old"
         enable_diagnostics="true"
         thread_naming_pattern="cl"
         timer_type="new"
         timer.min_threads="4"
         timer.max_threads="30"
         timer.keep_alive_time="3000"
         timer.queue_max_size="100"
         timer.wheel_size="200"
         timer.tick_time="50"
         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="30"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="true"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="2"
         oob_thread_pool.max_threads="30"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="discard"/>
        <TCPPING initial_hosts="131.10.20.16[8010],131.10.20.17[8010],131.10.20.182[8010]" port_range="2"
         timeout="3000" num_initial_members="3" />

    <MERGE3 max_interval="30000"
            min_interval="10000"/>

    <FD_SOCK/>
    <FD_ALL interval="3000" timeout="10000" />
    <VERIFY_SUSPECT timeout="500"  />
    <BARRIER />
    <pbcast.NAKACK use_mcast_xmit="false"
                   retransmit_timeout="100,300,600,1200"
                   discard_delivered_msgs="true" />
    <UNICAST3 conn_expiry_timeout="0"/>

    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="10m"/>
    <pbcast.GMS print_local_addr="true" join_timeout="5000"
                max_bundling_time="30"
                view_bundling="true"/>
    <UFC max_credits="2M"
         min_threshold="0.4"/>
    <MFC max_credits="2M"
         min_threshold="0.4"/>
    <FRAG2 frag_size="60000"  />
    <pbcast.STATE_TRANSFER/>
</config>

Upvotes: 2

Views: 5652

Answers (2)

Sebastian Łaskawiec
Sebastian Łaskawiec

Reputation: 2737

As @flavius suggested the main cause is the one of your nodes stopped for some reason and failed to reply to an RPC.

I suggest to change the logging level of JGroups so that you could see why a node has been suspected (it may happen by either FD_SOCK or FD_ALL protocol) and why it was eliminated from the view (it is very likely that this happened because of VERIFY_SUSPECT protocol).

You could also check why that happened. In most of the cases it is caused by long GC pauses. But your VM can also be paused by the host machine for other reasons. I suggest using JHiccup in both VM and attach it as Java Agent to your process. This way you should notice whether it's JVM Stop The World caused this or was the OS.

Upvotes: 3

Radim Vansa
Radim Vansa

Reputation: 5888

TimeoutException says only that a response for an RPC has not been received within timeout, nothing more. That could happen when the server is under stress, but that's probably not the case here - the following logs say that the node was 'suspected' - the node was probably non-responsive for more than 10 seconds (that's the limit in your configuration, see FD_ALL).

First thing check the log in that server for errors and also GC logs for any stop-the-world pauses.

Upvotes: 3

Related Questions