Reputation: 882
It most have been a version specific bug. Updating to [email protected] solved the issue and the machines connect fine. I'll leave the question for eventualy someone who faces the same issue.
I have two servers in a two member replica set.
mx.aireclaim.com is the PRIMARY and vimax.aireclaim.com is the SECONDARY. After a reboot on mx.aireclaim.com I'm faced with a connection issue on the secondary side. I've checked the firewall and can access the 27017 port from each to machine to the other.
I keep getting this error in my secondary:
Error in heartbeat (...) response status: NetworkInterfaceExceededTimeLimit:
Couldn't get a connection within the time limit
Is there something foul I'm missing? I've tried syncing the date on both machines, disabling firewalls, multiple restarts, and config changes, to no avail.
Here are the replicaset config and rs.status():
aireclaimRs:PRIMARY> rs.config()
{
"_id" : "aireclaimRs",
"version" : 3,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 0,
"host" : "vimax.aireclaim.com:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 2,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "mx.aireclaim.com:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"catchUpTimeoutMillis" : -1,
"catchUpTakeoverDelayMillis" : 30000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5be0f8105dfcbe069f5c8533")
}
}
aireclaimRs:PRIMARY> rs.status()
{
"set" : "aireclaimRs",
"date" : ISODate("2020-03-23T20:08:25.719Z"),
"myState" : 1,
"term" : NumberLong(20),
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(0, 0),
"t" : NumberLong(-1)
},
"appliedOpTime" : {
"ts" : Timestamp(1584994100, 1),
"t" : NumberLong(20)
},
"durableOpTime" : {
"ts" : Timestamp(1584994100, 1),
"t" : NumberLong(20)
}
},
"members" : [
{
"_id" : 0,
"name" : "vimax.aireclaim.com:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1595,
"optime" : {
"ts" : Timestamp(1584838063, 1),
"t" : NumberLong(11)
},
"optimeDurable" : {
"ts" : Timestamp(1584838063, 1),
"t" : NumberLong(11)
},
"optimeDate" : ISODate("2020-03-22T00:47:43Z"),
"optimeDurableDate" : ISODate("2020-03-22T00:47:43Z"),
"lastHeartbeat" : ISODate("2020-03-23T20:08:24.148Z"),
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
},
{
"_id" : 1,
"name" : "mx.aireclaim.com:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 8376,
"optime" : {
"ts" : Timestamp(1584994100, 1),
"t" : NumberLong(20)
},
"optimeDate" : ISODate("2020-03-23T20:08:20Z"),
"electionTime" : Timestamp(1584992519, 1),
"electionDate" : ISODate("2020-03-23T19:41:59Z"),
"configVersion" : 3,
"self" : true
}
],
"ok" : 1,
"operationTime" : Timestamp(1584994100, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1584994100, 1),
"signature" : {
"hash" : BinData(0,"oGCz38lgYjsGTrWc3maAD2vyc6M="),
"keyId" : NumberLong("6765287379189104641")
}
}
}
aireclaimRs:SECONDARY> rs.status()
{
"set" : "aireclaimRs",
"date" : ISODate("2020-03-23T19:56:37.132Z"),
"myState" : 2,
"term" : NumberLong(20),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"heartbeatIntervalMillis" : NumberLong(2000),
"optimes" : {
"lastCommittedOpTime" : {
"ts" : Timestamp(0, 0),
"t" : NumberLong(-1)
},
"appliedOpTime" : {
"ts" : Timestamp(1584838063, 1),
"t" : NumberLong(11)
},
"durableOpTime" : {
"ts" : Timestamp(1584838063, 1),
"t" : NumberLong(11)
}
},
"members" : [
{
"_id" : 0,
"name" : "vimax.aireclaim.com:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 890,
"optime" : {
"ts" : Timestamp(1584838063, 1),
"t" : NumberLong(11)
},
"optimeDate" : ISODate("2020-03-22T00:47:43Z"),
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"configVersion" : 3,
"self" : true,
"lastHeartbeatMessage" : ""
},
{
"_id" : 1,
"name" : "mx.aireclaim.com:27017",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : {
"ts" : Timestamp(0, 0),
"t" : NumberLong(-1)
},
"optimeDurable" : {
"ts" : Timestamp(0, 0),
"t" : NumberLong(-1)
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2020-03-23T19:56:31.106Z"),
"lastHeartbeatRecv" : ISODate("2020-03-23T19:56:36.034Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "Couldn't get a connection within the time limit",
"syncingTo" : "",
"syncSourceHost" : "",
"syncSourceId" : -1,
"infoMessage" : "",
"configVersion" : -1
}
],
"ok" : 1,
"operationTime" : Timestamp(1584838063, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1584993390, 1),
"signature" : {
"hash" : BinData(0,"zINnjtBKXZ14gJrxrUbT2zyurqQ="),
"keyId" : NumberLong("6765287379189104641")
}
}
}
This is what is happening in the logs:
tail /var/log/mongodb/mongodb.log
2020-03-23T21:08:09.420+0100 I NETWORK [conn2630] end connection 144.76.84.5:50114 (2 connections now open)
2020-03-23T21:08:29.420+0100 I NETWORK [listener] connection accepted from 144.76.84.5:50118 #2631 (3 connections now open)
2020-03-23T21:08:29.420+0100 I NETWORK [conn2631] end connection 144.76.84.5:50118 (2 connections now open)
2020-03-23T21:08:49.420+0100 I NETWORK [listener] connection accepted from 144.76.84.5:50122 #2632 (3 connections now open)
2020-03-23T21:08:49.420+0100 I NETWORK [conn2632] end connection 144.76.84.5:50122 (2 connections now open)
2020-03-23T21:09:09.421+0100 I NETWORK [listener] connection accepted from 144.76.84.5:50126 #2633 (3 connections now open)
2020-03-23T21:09:09.421+0100 I NETWORK [conn2633] end connection 144.76.84.5:50126 (2 connections now open)
2020-03-23T21:09:18.717+0100 I NETWORK [conn2629] end connection 127.0.0.1:60316 (1 connection now open)
2020-03-23T21:09:29.421+0100 I NETWORK [listener] connection accepted from 144.76.84.5:50130 #2634 (2 connections now open)
2020-03-23T21:09:29.421+0100 I NETWORK [conn2634] end connection 144.76.84.5:50130 (1 connection now open)
tail /var/log/mongodb/mongod.log
2020-03-23T20:48:13.408+0100 I REPL [replexec-1] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2020-03-23T20:48:17.592+0100 I REPL_HB [replexec-1] Error in heartbeat (requestId: 74) to mx.aireclaim.com:27017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-03-23T20:48:19.658+0100 I REPL [rsBackgroundSync] waiting for 2 pings from other members before syncing
2020-03-23T20:48:24.791+0100 I REPL [replexec-1] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2020-03-23T20:48:28.092+0100 I REPL_HB [replexec-1] Error in heartbeat (requestId: 76) to mx.aireclaim.com:27017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-03-23T20:48:29.593+0100 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mx.aireclaim.com:27017 - NetworkInterfaceExceededTimeLimit: Operation timed out
2020-03-23T20:48:29.593+0100 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mx.aireclaim.com:27017
2020-03-23T20:48:34.661+0100 I REPL [rsBackgroundSync] waiting for 2 pings from other members before syncing
2020-03-23T20:48:35.839+0100 I REPL [replexec-1] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majority (mask 0x1)
2020-03-23T20:48:38.593+0100 I REPL_HB [replexec-1] Error in heartbeat (requestId: 77) to mx.aireclaim.com:27017, response status: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
So it looks like the primary accepts the connection then ends it.
You would think that there is some conectivity issue between the two machines, but I can :
aireclaim@aireclaim-platform-server:~> mongo mx.aireclaim.com
MongoDB shell version v3.6.10
connecting to: mongodb://mx.aireclaim.com:27017/test?gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("de193403-69ca-4f74-a6c5-ee384eb420d7") }
MongoDB server version: 3.6.3
aireclaimRs:PRIMARY>
and:
root@mx /home/aireclaim mongo vimax.aireclaim.com
MongoDB shell version v3.6.3
connecting to: mongodb://vimax.aireclaim.com:27017/test
MongoDB server version: 3.6.10
aireclaimRs:SECONDARY>
Also I can atest that both machines have port 27017 open and that they are configured to run on those Ips.
In fact the entire thing remains funcional somehow. The clients that atempt to connect to the replicaset manage to do it, but It looks as though the secondary is no longer replicating.
Any insight would be a godsend.
Upvotes: 4
Views: 7352