Mosloko
Mosloko

Reputation: 13

Hazelcast cache distribution issue on two WAS nodes

in my project I'm using Hazelcast 3.7.8 and I have an issue with the data distribution between applications and nodes.

I have 2 nodes and for each node, I have 4 Spring applications deployed on a WAS with a single JVM process.

Those applications share between them a Map. Each application have an hazelcast-configuration.xml file, but all files are the same, except for the network port (5701, 5702, 5703, 5704).

Often but not always, after a deploy of one of those application on each node at same time, the distributed data are not the same. The deployed app (on each nodes) has a data set, and the other app another one.

        <cache:annotation-driven cache-manager="cacheManager" />
        <bean id="cacheManager" class="com.hazelcast.spring.cache.HazelcastCacheManager">
            <constructor-arg ref="hazelcastInstance" />
        </bean>  
        <hz:hazelcast id="hazelcastInstance">
            <hz:config>
                <hz:instance-name>myCacheInstance</hz:instance-name>
                <hz:group name="qualification" password="qualification"/>
                <hz:properties>
                    <hz:property name="hazelcast.health.monitoring.level">OFF</hz:property>
                    <hz:property name="hazelcast.health.monitoring.delay.seconds">3600</hz:property>
                </hz:properties>
                <hz:network port="5701" port-auto-increment="true">
                    <hz:join>
                        <hz:multicast enabled="false" />
                        <hz:tcp-ip enabled="true">
                            <hz:member>NODE1</hz:member>
                            <hz:member>NODE2</hz:member>
                        </hz:tcp-ip>
                    </hz:join>
                </hz:network>
                <hz:partition-group enabled="false"/>
                <hz:map name="my-map" 
                    backup-count="1"
                    async-backup-count="1"
                    time-to-live-seconds="7200"
                    max-idle-seconds="0"
                    eviction-policy="LRU"
                    max-size="15"
                    max-size-policy="USED_HEAP_PERCENTAGE"
                    eviction-percentage="25"
                    min-eviction-check-millis="100"
                    merge-policy="com.hazelcast.map.merge.PassThroughMergePolicy">
                </hz:map>
                <hz:services enable-defaults="true"/>   
            </hz:config>
        </hz:hazelcast>  
[LOCAL] [qualification] [3.7.8] You configured your member address as host name. Please be aware of that your dns can be spoofed. Make sure that your dns configurations are correct. 
[LOCAL] [qualification] [3.7.8] Resolving domain name 'NODE1' to address(es): [192.237.154.88] 
[LOCAL] [qualification] [3.7.8] You configured your member address as host name. Please be aware of that your dns can be spoofed. Make sure that your dns configurations are correct. 
[LOCAL] [qualification] [3.7.8] Resolving domain name 'NODE2' to address(es): [192.237.155.244] 
[LOCAL] [qualification] [3.7.8] Interfaces is disabled, trying to pick one address from TCP-IP config addresses: [NODE2/192.237.155.244, NODE1/192.237.154.88] 
[LOCAL] [qualification] [3.7.8] Prefer IPv4 stack is true. 
[LOCAL] [qualification] [3.7.8] Picked [NODE2]:5705, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5705], bind any local is true [NODE2]:5705 [qualification] [3.7.8] Hazelcast 3.7.8 (20170525 - 4e820fa) starting at [NODE2]:5705 [NODE2]:5705 
[qualification] [3.7.8] Copyright (c) 2008-2016, Hazelcast, Inc. All Rights Reserved. [NODE2]:5705 
[qualification] [3.7.8] Configured Hazelcast Serialization version : 1 [NODE2]:5705 
[qualification] [3.7.8] Backpressure is disabled [NODE2]:5705 
[qualification] [3.7.8] Creating TcpIpJoiner [NODE2]:5705 
[qualification] [3.7.8] Starting 8 partition threads [NODE2]:5705 [qualification] [3.7.8] Starting 5 generic threads (1 dedicated for priority tasks) [NODE2]:5705 
[qualification] [3.7.8] [NODE2]:5705 is STARTING [NODE2]:5705 [qualification] [3.7.8] TcpIpConnectionManager configured with Non Blocking IO-threading model: 3 input threads and 3 output threads [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE1/192.237.154.88:5703, timeout: 0, bind-any: true [NODE2]:5705 [qualification] [3.7.8] Connecting to NODE1/192.237.154.88:5704, timeout: 0, bind-any: true [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE2/192.237.155.244:5703, timeout: 0, bind-any: true [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE1/192.237.154.88:5705, timeout: 0, bind-any: true [192.237.155.244]:5703 
[dev] [3.7.8] Accepting socket connection from /192.237.155.244:37105 [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE2/192.237.155.244:5704, timeout: 0, bind-any: true [192.237.155.244]:5703 
[dev] [3.7.8] Established socket connection between /192.237.155.244:5703 and /192.237.155.244:37105 [NODE2]:5704 
[qualification] [3.7.8] Accepting socket connection from /192.237.155.244:50221 [NODE2]:5704 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:5704 and /192.237.155.244:50221 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:37105 and NODE2/192.237.155.244:5703 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:49809 and NODE1/192.237.154.88:5704 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:37358 and NODE1/192.237.154.88:5703 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:50221 and NODE2/192.237.155.244:5704 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:45740 and NODE1/192.237.154.88:5705 [192.237.155.244]:5703 
[dev] [3.7.8] Wrong bind request from [NODE2]:5705! This node is not requested endpoint: [NODE2]:5703 [192.237.155.244]:5703 
[dev] [3.7.8] Connection[id=2, /192.237.155.244:5703->/192.237.155.244:37105, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [NODE2]:5705! This node is not requested endpoint: [NODE2]:5703 [NODE2]:5705 
[qualification] [3.7.8] Connection[id=2, /192.237.155.244:49809->NODE1/192.237.154.88:5704, endpoint=[NODE1]:5704, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side [NODE2]:5705 
[qualification] [3.7.8] Connection[id=1, /192.237.155.244:37105->NODE2/192.237.155.244:5703, endpoint=[NODE2]:5703, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE1/192.237.154.88:5704, timeout: 0, bind-any: true [NODE2]:5705 
[qualification] [3.7.8] Connecting to NODE2/192.237.155.244:5703, timeout: 0, bind-any: true [192.237.155.244]:5703 
[dev] [3.7.8] Accepting socket connection from /192.237.155.244:59036 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:59036 and NODE2/192.237.155.244:5703 [NODE2]:5705 
[qualification] [3.7.8] Established socket connection between /192.237.155.244:33775 and NODE1/192.237.154.88:5704 [192.237.155.244]:5703 
[dev] [3.7.8] Established socket connection between /192.237.155.244:5703 and /192.237.155.244:59036 [192.237.155.244]:5703 
[dev] [3.7.8] Wrong bind request from [NODE2]:5705! This node is not requested endpoint: [NODE2]:5703 [192.237.155.244]:5703 
[dev] [3.7.8] Connection[id=3, /192.237.155.244:5703->/192.237.155.244:59036, endpoint=null, alive=false, type=MEMBER] closed. Reason: Wrong bind request from [NODE2]:5705! This node is not requested endpoint: [NODE2]:5703 [NODE2]:5705 
[qualification] [3.7.8] Connection[id=6, /192.237.155.244:59036->NODE2/192.237.155.244:5703, endpoint=[NODE2]:5703, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side [NODE2]:5705 
[qualification] [3.7.8] Connection[id=7, /192.237.155.244:33775->NODE1/192.237.154.88:5704, endpoint=[NODE1]:5704, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side [NODE2]:5705 
[qualification] [3.7.8] Ignoring master response [NODE1]:5703 from [NODE1]:5703 since this node has an active master [NODE2]:5704 [NODE2]:5705 
[qualification] [3.7.8] Ignoring master response [NODE1]:5703 from [NODE1]:5703 since this node has an active master [NODE2]:5704

what's wrong?

thanks in advance

Upvotes: 1

Views: 320

Answers (1)

Neil Stevenson
Neil Stevenson

Reputation: 3150

There four areas to look at here.

Each Hazelcast instance selects an inbound port, which in the configuration shown is specified port="5701" port-auto-increment="true".

What this means is when the instance starts, it will try to use port 5701. If that port is in-use (eg. by another Hazelcast instance), the auto-increment flag means to try the next port 5702, then 5703, and so on until one is found that is available.

(1) Based on the above, you can and probably should use the same configuration for all your Hazelcast instances. If they are correctly set-up it shouldn't cause the error described above, but if they have some unintentional differences that could be reason. Set them all the same, see what happens.

You could also change


                            <hz:member>NODE1</hz:member>
                            <hz:member>NODE2</hz:member>

to

                            <hz:member>NODE1:5701</hz:member>
                            <hz:member>NODE1:5702</hz:member>
                            <hz:member>NODE1:5703</hz:member>
                            <hz:member>NODE1:5704</hz:member>
                            <hz:member>NODE2:5701</hz:member>
                            <hz:member>NODE2:5702</hz:member>
                            <hz:member>NODE2:5703</hz:member>
                            <hz:member>NODE2:5704</hz:member>

(2) The logging line [qualification] [3.7.8] Creating TcpIpJoiner [NODE2]:5705 implies that ports 5701, 5702, 5703 and 5704 are in use. Which probably means four Hazelcast instances are already running on that node, so this is the fifth. If you're only expecting four instances and there are five, perhaps one of the earlier instance shutdowns hadn't completed.

(3) The configuration <hz:partition-group enabled="false"/> means that data backups are put on any other Hazelcast instance, which might mean an instance in the same WAS process. If that WAS process fails, then data and it's backup may be lost. Using the HOST_AWARE setting would be safer, but you've only got two host machines and have configured for the primary copy, a synchronous backups and an asynchronous backup -- three copies in total, to try to spread across 2 hosts, where each copy is on a host with a different IP address, so cannot be achieved.

(4) The logging line [qualification] [3.7.8] Starting 8 partition threads suggests it's a 4 CPU machine, which isn't going to be enough to run all that load adequately.

++

Also, 3.7.8 is an old version. If you're going to have to change to bring stability, you may as well upgrade too.

Upvotes: 0

Related Questions