Reputation: 782
When a new box starts up (or presumably gets the app pool recycled), we're seeing a timeout error for every redis request. What's interesting is that it's probably 1/30 or so. That is, 30 boxes will boot up just fine and work (actual call is a Redis Lock call) to every 1 box that boots up in this faulty state. The example below shows 9k items in queue. The ConnectionMultiplexer is being initialized lazily, per MS azure recommendation (though we're not on Azure) and here's the call:
var db = m_dbFactory.GetDatabase();
bool gotLock = db.LockTake(key, value, m_redisLockConfig.RedisLockMaxAgeTimeSpan);
and we're using Ninject to get a singleton of that dbFactory injected in:
kernel.Bind<IRedisDatabaseFactory>().To<RedisDatabaseFactory>().InSingletonScope();
We've had to redeploy the code (recycling the app pool) to fix the issue, or kill the 1 bad box behind the load balancer. Has anyone come across this problem before? I see we have 9k items in queue that haven't been written to outbound network, following an azure troubleshooting link: https://azure.microsoft.com/en-us/blog/investigating-timeout-exceptions-in-stackexchange-redis-for-azure-redis-cache/
If the connection was not opened, however, I am specifically throwing an error from my redis db factory (which I'm not seeing in our logs). Here's the whole class to see the connectionmultiplexer initialization:
public class RedisDatabaseFactory : IRedisDatabaseFactory
{
private readonly Lazy<IConnectionMultiplexer> m_lazyConnectionMultiplexer;
public RedisDatabaseFactory(IRedisConfig redisConfig)
{
var endPoint = new DnsEndPoint(redisConfig.Host, redisConfig.Port);
var configOptions = new ConfigurationOptions
{
EndPoints = { endPoint },
Password = redisConfig.Password,
ConnectTimeout = 5000,
AbortOnConnectFail = false
};
m_lazyConnectionMultiplexer = new Lazy<IConnectionMultiplexer>(() =>
ConnectionMultiplexer.Connect(configOptions));
}
private IConnectionMultiplexer Connection
{
get { return m_lazyConnectionMultiplexer.Value; }
}
/// <summary>
/// Gets a connected redis database
/// </summary>
/// <exception cref="Exception"></exception>
/// <returns>Connected redis database</returns>
public IDatabase GetDatabase()
{
if (!Connection.IsConnected)
{
throw new Exception("Redis connection failure");
}
return Connection.GetDatabase();
}
}
Here's the stack trace:
System.TimeoutException: Timeout performing SET mykey, inst: 0, mgr: ExecuteSelect, err: never, queue: 9058, qu: 9058, qs: 0, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=1,Max=32767), clientName: myclient
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server)
at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server)
at StackExchange.Redis.RedisDatabase.StringSet(RedisKey key, RedisValue value, Nullable1 expiry, When when, CommandFlags flags)
at StackExchange.Redis.RedisDatabase.LockTake(RedisKey key, RedisValue value, TimeSpan expiry, CommandFlags flags)
I changed the name of my key, client name, and removed backticks.
Upvotes: 1
Views: 2126
Reputation: 782
This is really late, but we did eventually make a change that solved the problem. We upgraded to the latest StackExchange.Redis in case the issue was fixed by Marc Gravell and team, but we also made the following change:
m_lazyConnectionMultiplexer = new Lazy<IConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(configOptions),LazyThreadSafetyMode.PublicationOnly;
so that should the connection multiplexer initialize to a bad state, another would get initialized afterwards. After making those 2 changes, we never saw the issue again. I believe the issue was not actually in app pool recycle but in our process of tearing down and building up the boxes from Amazon Machine Image on a regular basis. When they were built back up, occasionally 1 was in a bad state. I wish I had pinpointed the fix, but that's what worked for us.
Upvotes: 1
Reputation: 3122
Two things jump out at me from your timeout error message.
If that doesn't fix things for you, then you may want to monitor your client side CPU usage. If your client CPU is spiking up around 100%, then your system will just not have enough CPU to keep up with all the work you are trying to give it. Upgrade your client machine to something faster. The default Min Threads in the ThreadPool is 1 in your case, which often indicates that you have only 1 CPU Core, which may not be enough.
Upvotes: 0