DanielG
DanielG

Reputation: 1237

Tibco-Ems Failover Issue

I have 2 Tibco-Ems Servers running, with fault tolerant setup. If one server is not available, the active server switches to the failover server as expected.

However, every now and then I get strange errors. Then the new active server says: "reconnect failed: connection unknown for id= XY"

This only happens if there is an open connection on my client. But that's what I would expect, the connection should also switch to the new active server. And as I said, sometimes it works and sometimes not.

When I register for the EMS-Exceptions in my client, I get the error: "Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host."

Stacktrace: at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._readEx(Byte[] buffer, Int32 offset, Int32 size) at TIBCO.EMS.LinkTcp._ReadWireMsg() at TIBCO.EMS.LinkTcp.LinkReader.Work()

Right now I have no more idea what I could do. Maybe somebody can help me to understand what the exact problem is. Thanks in Advance

UPDATE: A late update here: Even though I get the error "reconnect failed" it works as expected. The second server will take over.

Upvotes: 3

Views: 9602

Answers (3)

Aymeric Duché
Aymeric Duché

Reputation: 21

We had the same issue, our mistake was that the store (ems db) was'nt share between the active and the standby node, so when the active ems failed, the new active ems was'nt able to recover connections and messages.

Upvotes: 2

nochum
nochum

Reputation: 795

Here's what's going on... An EMS server keeps track of the active client connections that it has, and keeps information about these connections in the meta.db store file. Upon fault-tolerant failover the new primary EMS instance is able to recover the client connections when the clients reconnect by matching information that the client provides with information stored in the meta.db store file.

There is a point in time when EMS cleans up client connections that have not reconnected. That time is governed by the ft_reconnect_timeout parameter in the tibemsd.conf configuration file. The default setting for this configuration parameter is 60 seconds. Depending on your logging settings when EMS cleans up "expired" connections you may see a mssage indicating that it has "purged" a client connection in your EMS logs.

There are times when the client eventually does attempt to reconnect after the EMS server has already purged the "expired" connection. This can happen in the event that a network partition prevents the client from successfully reconnecting to the EMS server until after the EMS server cleans up the connection. When this happens you will see the, "Reconnect failed: connection unknown..." message.

When a client is unable to "re-connect" due to this error, it simply attempts a connection as a "new" connection. This works and it is able to continue processing.

Upvotes: 5

aadi
aadi

Reputation: 128

This happens when you are using a client side FT and not the server level FT, at least in our case when we faced this issue that was the underlying cause.

If you are using the ems servers with the FT URL server1:port,server2:port but the servers weren't truly in FT mode, when the connection switches between these two servers, you will have this issue as the connection moves to a different server but the existing connection on the failed server couldn't be destroyed or acquired by the new active server, due to incoherent FT setup.

In a true FT setup on the server side, the active server automatically assumes these connections to be active and continues to serve them. Please verify the server level configuration.

For us, providing the server level FT helped solve this issue.

Upvotes: 0

Related Questions