timeout errors from Tomcat talking to MongoDB in Azure

Question

My system is a TomCat 7 server running on Ubuntu talking to a MongoDB cluster running in CentOS. We have this on AWS and it is working just fine.

I recently brought up the exact same thing on Azure and we are having constant, seemingly random timeouts when the tomcat app tries to query MongoDB. A typical error is:

Jan 31 08:13:54 catalina.out:  Jan 31, 2014 4:14:09 PM com.mongodb.DBPortPool gotError
Jan 31 08:13:54 catalina.out:  WARNING: emptying DBPortPool to xxx.cloudapp.net/xxx.xxx.xxx.xxx:21191 b/c of error
Jan 31 08:13:54 catalina.out:  java.net.SocketException: Connection timed out
Jan 31 08:13:54 catalina.out:   at java.net.SocketInputStream.socketRead0(Native Method)
Jan 31 08:13:54 catalina.out:   at java.net.SocketInputStream.read(SocketInputStream.java:146)
Jan 31 08:13:54 catalina.out:   at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
Jan 31 08:13:54 catalina.out:   at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
Jan 31 08:13:54 catalina.out:   at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
Jan 31 08:13:54 catalina.out:   at org.bson.io.Bits.readFully(Bits.java:46)
Jan 31 08:13:54 catalina.out:   at org.bson.io.Bits.readFully(Bits.java:33)
Jan 31 08:13:54 catalina.out:   at org.bson.io.Bits.readFully(Bits.java:28)
Jan 31 08:13:54 catalina.out:   at com.mongodb.Response.(Response.java:40)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBPort.go(DBPort.java:142)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBPort.call(DBPort.java:92)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DB.command(DB.java:262)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DB.command(DB.java:244)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBCollection.getCount(DBCollection.java:985)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBCollection.getCount(DBCollection.java:956)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBCollection.getCount(DBCollection.java:931)
Jan 31 08:13:54 catalina.out:   at com.mongodb.DBCollection.count(DBCollection.java:878)
Jan 31 08:13:54 catalina.out:   at com.eweware.service.base.store.impl.mongo.dao.BaseDAOImpl._exists(BaseDAOImpl.java:788)
Jan 31 08:13:54 catalina.out:   at com.eweware.service.base.store.impl.mongo.dao.GroupDAOImpl._exists(GroupDAOImpl.java:18)

I am using Java driver 2.11.4 and initializing it as follows:

        builder.autoConnectRetry(true)
                .connectionsPerHost(10)
                .writeConcern(WriteConcern.FSYNCED)
                .connectTimeout(30000)
                .socketKeepAlive(true);

In reading around the interwebs I saw some materials suggesting that there is an Azure issue and some C# suggestions but I have not seen anything on how to correct it from Java.

Some more details:

This happens when the MongoDB is a single node or a replica set
This happens whether or not the server is under load. In fact, the server seems to perform better under some load than from cold start. But even under constant load it will timeout
The timeouts appear to be random, in that there is no discernable pattern in what call will fail or when it will fail. Sometimes it will go an hour with no issues, other times every call will fail
If I retry the calls on a timeout error, eventually it will work. Sometimes it takes >100 retries, other times just a single one to work.

Here is the retry code I am trying:

private DBObject findOneRetry(DBObject criteria, DBObject fields, DBCollection collection) throws SystemErrorException {
    DBObject obj = null;
    for (int attempt = 1; attempt < MAX_RETRIES; attempt++) {
        try {
            obj = collection.findOne(criteria, fields);  // getting SocketException inside here
            return obj;
        } catch (Exception e) {
            if (attempt > MAX_RETRIES) {
                throw new SystemErrorException(makeErrorMessage("findOneRetry", "find", attempt, e, null), e, ErrorCodes.SERVER_DB_ERROR);
            } else {
                logger.warning(getClass().getName() + ": findOneRetry failed and will retry in attempt #" + attempt + " in collection " + _getCollection());
            }
        }
    }
    return obj;
}

Any suggestions on how to correct?

Thanks in advance!

Reza · Accepted Answer

This is due to high tcp_keepalive_time in your CentOS server.

In your MongoS server: sudo nano /proc/sys/net/ipv4/tcp_keepalive_time Change 7200 to 60

Restart your Azure instance.

Update: In order to make sure your VM always has the value of tcp_keepalive_time:

Add this line:

bash -c 'echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time'

to:

/etc/rc.d/rc.local

Update to the update: For most flavors for Linux, there is a /etc/sysctl.d/ directory. Create a file, e.g. mongo.conf containing:

net.ipv4.tcp_keepalive_time = 60

Put the file in that directory and run:

sysctl -p /etc/sysctl.d/mongo.conf

Verify the change with:

sysctl net.ipv4.tcp_keepalive_time

This will survive reboots in RedHat-based and Debian-based systems I've come across. You'll want to be sure and check /etc/sysctl.conf and any other files in /etc/sysctl.d to see if the variable is already being set to something else and make appropriate changes.

timeout errors from Tomcat talking to MongoDB in Azure

Answers (1)

Related Questions