Ignite takes too long time to start and often fails during app runtime

Question

I made cluster of two nodes with my Spring Boot app. No any external resources - only DB in memory and Apache Ignite as distributed cache. I made tcp dumps and JFR of two nodes when one node was already started and the second one committed to start.

And I see this on node that tried to start:

Ten seconds for getting only byte. At the same time another Ignite SPI thread "tcp-disco-msg-worker" on this node tried to read from the same remote IP and port and it took about 10 seconds too.

And at the same time first node which was already started tried to do the same - read one byte for about ten seconds

So it seems that they were trying to read messages from each other at the same time and blocked each other - when this SPI thread tries to read it can't write answer.

But I don't see such behaviour on another cluster with two servers with exactly same app and its configuration - no any long reads.

Does anyone have idea what is wrong with my first cluster and why Ignite tries to read metainformation about topology at the same time on both nodes?

And the main problem is that I have frequent fails of Ignite on one of the nodes of my first cluster as SPI threads decide that cluster is in SEGMENTED state. And SPI thread comes to this state because it can't get metainformation from one of nodes during the regular communications.

It seems that some network problems is on this cluster but all low level tests show that everything is OK.

I know that they say about SEGMENTED state - long GC, network problems and so on. But I don't see anything like this. Just very long reads during socket communication in SPI thread. And it is only on one cluster. Another cluster with the same configuration doesn't have such problems.

Could anyone give me an advice what should I check or test?

UPD: My config:

    @Bean
public IgniteCache getCache() {
    log.info("Ignite addresses: " + Arrays.asList(igniteAddresses));
    IgniteConfiguration conf = new IgniteConfiguration();
    Slf4jLogger gridLog = new Slf4jLogger(log);
    conf.setGridLogger(gridLog);
    conf.setClientMode(false);
    conf.setNetworkTimeout(networkTimeout);
    conf.setClassLoader(WaveFile.class.getClassLoader());

    TcpDiscoverySpi tcpDiscoverySpi = new TcpDiscoverySpi();
    TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
    ipFinder.setAddresses(Arrays.asList(igniteAddresses));
    tcpDiscoverySpi.setIpFinder(ipFinder);
    tcpDiscoverySpi.setAckTimeout(failureDetectionTimeout);
    conf.setDiscoverySpi(tcpDiscoverySpi);
    conf.setSqlConnectorConfiguration(null);

    MemoryConfiguration memoryConfiguration = new MemoryConfiguration();
    MemoryPolicyConfiguration memoryPolicyConfiguration = new MemoryPolicyConfiguration();
    memoryPolicyConfiguration.setInitialSize(5L * 1024 * 1024 * 1024);
    memoryConfiguration.setMemoryPolicies(memoryPolicyConfiguration);
    conf.setMemoryConfiguration(memoryConfiguration);

    Ignite ignite = Ignition.start(conf);
    // Cache Config
    CacheConfiguration cacheConfig = new CacheConfiguration<>();
    cacheConfig.setName(cacheName);
    cacheConfig.setCacheMode(CacheMode.PARTITIONED);
    cacheConfig.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
    cacheConfig.setCopyOnRead(false);
    cacheConfig.setEagerTtl(eagerTtl);
    cacheConfig.setExpiryPolicyFactory(FactoryBuilder.factoryOf(
            new TouchedExpiryPolicy(new Duration(TimeUnit.SECONDS, expirationPeriod))));
    return ignite.getOrCreateCache(cacheConfig);
}

Logs. You can see there 4 nodes, but actually I did my tests only on two of them.

2017-10-27 14:01:53.084 [main] ERROR  - Failed to resolve default logging config file: config/java.util.logging.properties
2017-10-27 14:02:35.175 [main] INFO  com.voice.Application - Ignite addresses: [10.200.43.28, 10.200.43.32, 10.200.43.33, 10.200.43.34]
2017-10-27 14:02:35.369 [main] INFO  o.a.ignite.internal.IgniteKernal -

>>>    __________  ________________
>>>   /  _/ ___/ |/ /  _/_  __/ __/
>>>  _/ // (7 7    // /  / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.1.0#20170720-sha1:a6ca5c8a
>>> 2017 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: http://ignite.apache.org

[14:02:35]    __________  ________________
[14:02:35]   /  _/ ___/ |/ /  _/_  __/ __/
[14:02:35]  _/ // (7 7    // /  / / / _/
[14:02:35] /___/\___/_/|_/___/ /_/ /___/
[14:02:35]
[14:02:35] ver. 2.1.0#20170720-sha1:a6ca5c8a
[14:02:35] 2017 Copyright(C) Apache Software Foundation
[14:02:35]
[14:02:35] Ignite documentation: http://ignite.apache.org
[14:02:35]
[14:02:35] Quiet mode.
[14:02:35]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[14:02:35]
2017-10-27 14:02:35.370 [main] INFO  o.a.ignite.internal.IgniteKernal - Config URL: n/a
2017-10-27 14:02:35.370 [main] INFO  o.a.ignite.internal.IgniteKernal - Daemon mode: off
[14:02:35] OS: Linux 3.10.0-327.el7.x86_64 amd64
2017-10-27 14:02:35.370 [main] INFO  o.a.ignite.internal.IgniteKernal - OS: Linux 3.10.0-327.el7.x86_64 amd64
2017-10-27 14:02:35.371 [main] INFO  o.a.ignite.internal.IgniteKernal - OS user: mopopov
2017-10-27 14:02:35.371 [main] INFO  o.a.ignite.internal.IgniteKernal - PID: 31516
[14:02:35] VM information: Java(TM) SE Runtime Environment 1.8.0_121-b13 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.121-b13
2017-10-27 14:02:35.372 [main] INFO  o.a.ignite.internal.IgniteKernal - Language runtime: Java Platform API Specification ver. 1.8
2017-10-27 14:02:35.372 [main] INFO  o.a.ignite.internal.IgniteKernal - VM information: Java(TM) SE Runtime Environment 1.8.0_121-b13 Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 25.121-b13
2017-10-27 14:02:35.373 [main] INFO  o.a.ignite.internal.IgniteKernal - VM total memory: 10.0GB
2017-10-27 14:02:35.374 [main] INFO  o.a.ignite.internal.IgniteKernal - Remote Management [restart: off, REST: on, JMX (remote: off)]
2017-10-27 14:02:35.374 [main] INFO  o.a.ignite.internal.IgniteKernal - IGNITE_HOME=null
2017-10-27 14:02:35.374 [main] INFO  o.a.ignite.internal.IgniteKernal - VM arguments: [-Xmx12000m, -XX:+UnlockCommercialFeatures, -XX:+HeapDumpOnOutOfMemoryError, -Dspring.profiles.active=prod]
2017-10-27 14:02:35.374 [main] INFO  o.a.ignite.internal.IgniteKernal - System cache's MemoryPolicy size is configured to 40 MB. Use MemoryConfiguration.systemCacheMemorySize property to change the setting.
2017-10-27 14:02:35.388 [main] INFO  o.a.ignite.internal.IgniteKernal - Configured caches [in 'sysMemPlc' memoryPolicy: ['ignite-sys-cache']]
2017-10-27 14:02:35.395 [pub-#14%null%] WARN  o.a.ignite.internal.GridDiagnostic - Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[14:02:35] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[14:02:35] Configured plugins:
2017-10-27 14:02:35.477 [main] INFO  o.a.i.i.p.p.IgnitePluginProcessor - Configured plugins:
[14:02:35]   ^-- None
2017-10-27 14:02:35.478 [main] INFO  o.a.i.i.p.p.IgnitePluginProcessor -   ^-- None
[14:02:35]
2017-10-27 14:02:35.478 [main] INFO  o.a.i.i.p.p.IgnitePluginProcessor -
2017-10-27 14:02:35.555 [main] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
2017-10-27 14:02:35.560 [main] WARN  o.a.i.s.c.tcp.TcpCommunicationSpi - Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[14:02:35] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
2017-10-27 14:02:35.586 [main] WARN  o.a.i.s.c.noop.NoopCheckpointSpi - Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
2017-10-27 14:02:35.627 [main] WARN  o.a.i.i.m.c.GridCollisionManager - Collision resolution is disabled (all jobs will be activated upon arrival).
[14:02:35] Security status [authentication=off, tls/ssl=off]
2017-10-27 14:02:35.628 [main] INFO  o.a.ignite.internal.IgniteKernal - Security status [authentication=off, tls/ssl=off]
2017-10-27 14:02:36.072 [main] INFO  o.a.i.i.p.r.p.t.GridTcpRestProtocol - Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
2017-10-27 14:02:36.133 [main] INFO  o.a.ignite.internal.IgniteKernal - Non-loopback local IPs: 10.200.43.34, fe80:0:0:0:250:56ff:fe93:3f60%eno16777984
2017-10-27 14:02:36.133 [main] INFO  o.a.ignite.internal.IgniteKernal - Enabled local MACs: 005056933F60
2017-10-27 14:02:36.188 [main] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=64941c34-2912-4934-8c92-c0c089391207]
2017-10-27 14:02:51.267 [main] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000]
2017-10-27 14:03:46.397 [exchange-worker-#29%null%] INFO  o.a.ignite.internal.exchange.time - Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=false, evt=10, node=TcpDiscoveryNode [id=64941c34-2912-4934-8c92-c0c089391207, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.34, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 4212-bio-lanta-01/10.200.43.34:47500], discPort=47500, order=13, intOrder=8, lastExchangeTime=1509102156183, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=64941c34-2912-4934-8c92-c0c089391207, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.34, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 4212-bio-lanta-01/10.200.43.34:47500], discPort=47500, order=13, intOrder=8, lastExchangeTime=1509102156183, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=null]
2017-10-27 14:03:47.022 [exchange-worker-#29%null%] INFO  o.a.i.i.p.cache.GridCacheProcessor - Started cache [name=ignite-sys-cache, memoryPolicyName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL]
2017-10-27 14:03:47.062 [exchange-worker-#29%null%] INFO  o.a.i.i.p.cache.GridCacheProcessor - Started cache [name=SHARED_AUDIO_CACHE, memoryPolicyName=default, mode=PARTITIONED, atomicity=ATOMIC]
2017-10-27 14:03:57.115 [grid-nio-worker-tcp-comm-0-#17%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Established outgoing communication connection [locAddr=/10.200.43.34:53456, rmtAddr=/10.200.43.28:47100]
2017-10-27 14:03:57.172 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], waitTime=0ms]
2017-10-27 14:03:57.252 [exchange-worker-#29%null%] INFO  o.a.ignite.internal.exchange.time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], crd=false]
2017-10-27 14:03:57.293 [sys-#32%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], time=0ms]
2017-10-27 14:03:57.344 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.GridCachePartitionExchangeManager - Rebalancing scheduled [order=[ignite-sys-cache, SHARED_AUDIO_CACHE]]
2017-10-27 14:03:57.344 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.GridCachePartitionExchangeManager - Rebalancing started [top=AffinityTopologyVersion [topVer=13, minorTopVer=0], evt=NODE_JOINED, node=64941c34-2912-4934-8c92-c0c089391207]
2017-10-27 14:03:57.345 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionDemander - Starting rebalancing [mode=SYNC, fromNode=110e4e93-2954-4b76-8649-a23f4a83990a, partitionsCount=100, topology=AffinityTopologyVersion [topVer=13, minorTopVer=0], updateSeq=1]
2017-10-27 14:03:57.391 [utility-#34%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionDemander - Completed (final) rebalancing [fromNode=110e4e93-2954-4b76-8649-a23f4a83990a, topology=AffinityTopologyVersion [topVer=13, minorTopVer=0], time=51 ms]
2017-10-27 14:03:57.395 [utility-#34%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionDemander - Starting rebalancing [mode=ASYNC, fromNode=110e4e93-2954-4b76-8649-a23f4a83990a, partitionsCount=518, topology=AffinityTopologyVersion [topVer=13, minorTopVer=0], updateSeq=1]
2017-10-27 14:03:57.449 [sys-#36%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionDemander - Completed (final) rebalancing [fromNode=110e4e93-2954-4b76-8649-a23f4a83990a, topology=AffinityTopologyVersion [topVer=13, minorTopVer=0], time=51 ms]
[14:03:57] Performance suggestions for grid  (fix if possible)
2017-10-27 14:03:57.463 [main] INFO  o.a.ignite.internal.IgniteKernal - Performance suggestions for grid  (fix if possible)
[14:03:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
2017-10-27 14:03:57.463 [main] INFO  o.a.ignite.internal.IgniteKernal - To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[14:03:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
2017-10-27 14:03:57.463 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[14:03:57]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=[g|G|m|M|k|K]' to JVM options)
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=[g|G|m|M|k|K]' to JVM options)
[14:03:57]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[14:03:57]   ^-- Speed up flushing of dirty pages by OS (alter vm.dirty_expire_centisecs parameter by setting to 500)
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Speed up flushing of dirty pages by OS (alter vm.dirty_expire_centisecs parameter by setting to 500)
[14:03:57]   ^-- Reduce pages swapping ratio (set vm.swappiness=10)
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Reduce pages swapping ratio (set vm.swappiness=10)
[14:03:57]   ^-- Disable fully synchronous writes (set 'writeSynchronizationMode' to PRIMARY_SYNC or FULL_ASYNC)
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal -   ^-- Disable fully synchronous writes (set 'writeSynchronizationMode' to PRIMARY_SYNC or FULL_ASYNC)
[14:03:57] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
2017-10-27 14:03:57.464 [main] INFO  o.a.ignite.internal.IgniteKernal - Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[14:03:57]
2017-10-27 14:03:57.465 [main] INFO  o.a.ignite.internal.IgniteKernal -
[14:03:57] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
2017-10-27 14:03:57.465 [main] INFO  o.a.ignite.internal.IgniteKernal - To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[14:03:57]
[14:03:57] Ignite node started OK (id=64941c34)
2017-10-27 14:03:57.465 [main] INFO  o.a.ignite.internal.IgniteKernal -
2017-10-27 14:03:57.466 [main] INFO  o.a.ignite.internal.IgniteKernal -

>>> +----------------------------------------------------------------------+
>>> Ignite ver. 2.1.0#20170720-sha1:a6ca5c8a97e9a4c9d73d40ce76d1504c14ba1940
>>> +----------------------------------------------------------------------+
>>> OS name: Linux 3.10.0-327.el7.x86_64 amd64
>>> CPU(s): 6
>>> Heap: 10.0GB
>>> VM name: 31516@4212-bio-lanta-01
>>> Local node [ID=64941C34-2912-4934-8C92-C0C089391207, order=13, clientMode=false]
>>> Local node addresses: [4212-bio-lanta-01/0:0:0:0:0:0:0:1%lo, /10.200.43.34, /127.0.0.1]
>>> Local ports: TCP:11211 TCP:47100 TCP:47500

[14:03:57] Topology snapshot [ver=13, servers=2, clients=0, CPUs=12, heap=22.0GB]
2017-10-27 14:03:57.468 [main] INFO  o.a.i.i.m.d.GridDiscoveryManager - Topology snapshot [ver=13, servers=2, clients=0, CPUs=12, heap=22.0GB]
2017-10-27 14:03:58.944 [exchange-worker-#29%null%] INFO  o.a.ignite.internal.exchange.time - Started exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], crd=false, evt=18, node=TcpDiscoveryNode [id=64941c34-2912-4934-8c92-c0c089391207, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.34, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 4212-bio-lanta-01/10.200.43.34:47500], discPort=47500, order=13, intOrder=8, lastExchangeTime=1509102156183, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], evtNode=TcpDiscoveryNode [id=64941c34-2912-4934-8c92-c0c089391207, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.34, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 4212-bio-lanta-01/10.200.43.34:47500], discPort=47500, order=13, intOrder=8, lastExchangeTime=1509102156183, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], customEvt=CacheAffinityChangeMessage [id=729ac7d5f51-0fc83046-7af0-474f-8487-f526629190a8, topVer=AffinityTopologyVersion [topVer=13, minorTopVer=0], exchId=null, partsMsg=null, exchangeNeeded=true]]
2017-10-27 14:03:58.952 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], waitTime=0ms]
2017-10-27 14:03:58.975 [exchange-worker-#29%null%] INFO  o.a.ignite.internal.exchange.time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], crd=false]
2017-10-27 14:03:58.991 [sys-#41%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=13, minorTopVer=1], time=0ms]
2017-10-27 14:03:58.994 [exchange-worker-#29%null%] INFO  o.a.i.i.p.c.GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=13, minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, node=110e4e93-2954-4b76-8649-a23f4a83990a]
2017-10-27 14:04:01.144 [main] INFO  com.voice.Application - Run config context -> ok
2017-10-27 14:04:01.145 [main] INFO  com.voice.Application - StartUp date: Fri Oct 27 14:01:48 MSK 2017

UPD2: logs when node became failed:

    2017-10-28 07:48:56.320 [grid-nio-worker-tcp-comm-2-#19%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:52329]
2017-10-28 07:48:56.323 [grid-nio-worker-tcp-comm-3-#20%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, rmtAddr=/0:0:0:0:0:0:0:1:55692]
2017-10-28 07:48:56.323 [grid-nio-worker-tcp-comm-0-#17%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:52335]
2017-10-28 07:48:59.535 [RTP-948] INFO  com.voice.rtp.RtpReceiverImpl - Start callId 887B4B84-C47F-461E-35B0-97D0D7F86A51-7092 RTP processing with local port 36496
2017-10-28 07:49:00.116 [grid-nio-worker-tcp-comm-1-#18%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/0:0:0:0:0:0:0:1:47100, rmtAddr=/0:0:0:0:0:0:0:1:55698]
2017-10-28 07:49:00.117 [grid-nio-worker-tcp-comm-2-#19%null%] INFO  o.a.i.s.c.tcp.TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:52341]
2017-10-28 07:49:00.372 [tcp-disco-msg-worker-#2%null%] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems).
2017-10-28 07:49:00.373 [disco-event-worker-#28%null%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=c1886afe-b28f-45b5-a91b-cfdd9e635374, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.33, 127.0.0.1], sockAddrs=[4212-bio-lanta-03/10.200.43.33:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=15, intOrder=10, lastExchangeTime=1509102474635, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false]
2017-10-28 07:49:00.399 [disco-event-worker-#28%null%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Stopping local node according to configured segmentation policy.
2017-10-28 07:49:00.400 [disco-event-worker-#28%null%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Node FAILED: TcpDiscoveryNode [id=110e4e93-2954-4b76-8649-a23f4a83990a, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.28, 127.0.0.1], sockAddrs=[/10.200.43.28:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=11, intOrder=7, lastExchangeTime=1509102530512, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false]
2017-10-28 07:49:00.413 [disco-event-worker-#28%null%] INFO  o.a.i.i.m.d.GridDiscoveryManager - Topology snapshot [ver=16, servers=3, clients=0, CPUs=18, heap=31.0GB]
2017-10-28 07:49:00.450 [Thread-52685] INFO  o.a.i.i.p.r.p.t.GridTcpRestProtocol - Command protocol successfully stopped: TCP binary
2017-10-28 07:49:00.452 [disco-event-worker-#28%null%] INFO  o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=16, minorTopVer=0], time=0ms]
2017-10-28 07:49:00.470 [disco-event-worker-#28%null%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Node FAILED: TcpDiscoveryNode [id=64941c34-2912-4934-8c92-c0c089391207, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.34, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /10.200.43.34:47500], discPort=47500, order=13, intOrder=8, lastExchangeTime=1509102540526, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false]
2017-10-28 07:49:00.479 [disco-event-worker-#28%null%] INFO  o.a.i.i.m.d.GridDiscoveryManager - Topology snapshot [ver=17, servers=2, clients=0, CPUs=12, heap=21.0GB]
2017-10-28 07:49:00.480 [disco-event-worker-#28%null%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Node FAILED: TcpDiscoveryNode [id=f6678eb1-f2f6-4da1-8d4d-a259c435b5a6, addrs=[0:0:0:0:0:0:0:1%lo, 10.200.43.32, 127.0.0.1], sockAddrs=[/10.200.43.32:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=14, intOrder=9, lastExchangeTime=1509102550535, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false]
2017-10-28 07:49:00.481 [disco-event-worker-#28%null%] INFO  o.a.i.i.m.d.GridDiscoveryManager - Topology snapshot [ver=18, servers=1, clients=0, CPUs=6, heap=10.0GB]
2017-10-28 07:49:01.125 [RTP-948] ERROR com.voice.rtp.RtpReceiverImpl - During RTP socket communications
javax.cache.CacheException: class org.apache.ignite.internal.NodeStoppingException: Operation has been cancelled (node is stopping).
        at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1312)
        at org.apache.ignite.internal.processors.cache.IgniteCacheProxy.cacheException(IgniteCacheProxy.java:2630)
        at org.apache.ignite.internal.processors.cache.IgniteCacheProxy.get(IgniteCacheProxy.java:1188)

Logs in DEBUG mode is here: https://drive.google.com/drive/folders/0B00-0mjd5sy1MjNnSkdyR1M4VFk?usp=sharing

Donz · Accepted Answer

The problem was in absence of cluster hostnames in DNS. After correcting this there is no ten seconds long reads and cluster still has working correctly for at least half a day.

Ignite takes too long time to start and often fails during app runtime

Answers (2)

Related Questions