Reputation: 163
For the last few days out team has been struggling with an ongoing issue where at very predictable intervals one ColdFusion instance has been white-screen-of-death-ing.
Every three hours the site would simply start returning a blank white page for any url. We would then restart the instance and everything would be great... for another three hours, almost to the minute. Of course this happened on a Friday, so all weekend people were taking turns re-booting the instance every time it died.
As best as I can discern, no one made any changes to either ColdFusion or our server environment right before this started happening. Before this the instance was running fine.
Since then we've seen that the isapi_redirect.log file for this instance is filled with Tomcat/connection errors.
We followed the excellent instructions at http://www.webtrenches.com/post.cfm/resolve-stability-problems-and-speed-up-coldfusion-10 and adjusted our connector settings as recommended. While this may have very well helped out general performance, and changed the timeframe from 3 to 3.5 hours between crashes, it has not resolved it.
Before that we even tried moving the site from one of our virtual servers to another with no luck.
We tried re-booting IIS and even re-booting the entire server the one night to see if that would help, and still nothing.
Below is as much information as I can provide from what we are seeing in our logs and our configurations. Any help would be very very much appreciated and please let me know what other details I can provide that would be useful.
We are running IIS v7.5.7600.16385
This is the only website/IIS record bound to this instance and it's bound specifically to it, not "All websites".
When the problem occurs, I do not think any requests makes it to the instance... the IIS logs show that connections are still happening, but the http.log files for the instance just stop.
I am not sure if the tomcat related errors are the problem or a symptom.
The server runs fine when the problem occurs, we have several other CF instances running along side this one that have no issues.
The CF admin for the instance in question loads and is completely responsive during the problem (This has not often, for me, been the case for other past issues with an instance).
Again, no one changed anything with our code, CF instance configuration, or server configuration directly prior to this problem starting as far as we can tell.
Server Product: ColdFusion
Version: 10,0,13,287689
Tomcat Version: 7.0.23.0
Edition: Enterprise
Operating System: Windows Server 2008 R2
OS Version: 6.1
Update Level: chf10000013.jar
Adobe Driver Version: 4.1 (Build 0001)
workers.properties:
worker.list=Instance_Codebase
worker.Instance_Codebase.type=ajp13
worker.Instance_Codebase.host=localhost
worker.Instance_Codebase.port=8014
worker.Instance_Codebase.max_reuse_connections=250
worker.Instance_Codebase.connection_pool_size=250
worker.Instance_Codebase.connection_pool_timeout=60
server.xml
<Server port="8009" shutdown="SHUTDOWN">
<Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on"></Listener>
<Listener className="org.apache.catalina.core.JasperListener"></Listener>
<Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener"></Listener>
<Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener"></Listener>
<GlobalNamingResources>
<Resource description="User database that can be updated and saved" name="UserDatabase" pathname="conf/tomcat-users.xml" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" type="org.apache.catalina.UserDatabase" auth="Container"></Resource>
</GlobalNamingResources>
<Service name="Catalina">
<Executor name="tomcatThreadPool" minSpareThreads="4" maxThreads="150" namePrefix="catalina-exec-"></Executor>
<Connector port="8014" protocol="AJP/1.3" redirectPort="8447" tomcatAuthentication="false" maxThreads="250" connectionTimeout="60000"></Connector>
<Engine jvmRoute="Instance_Codebase" name="Catalina" defaultHost="localhost">
<Realm className="org.apache.catalina.realm.LockOutRealm">
<Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase"></Realm>
</Realm>
<Host name="localhost" autoDeploy="false" unpackWARs="true" appBase="webapps">
<!--<Valve pattern="%h %l %u %t "%r" %s %b" directory="logs" prefix="localhost_access_log." className="org.apache.catalina.valves.AccessLogValve" suffix=".txt" resolveHosts="false"></Valve>-->
</Host>
</Engine>
<Connector port="8501" protocol="org.apache.coyote.http11.Http11NioProtocol" connectionTimeout="20000" redirectPort="8443" executor="tomcatThreadPool"></Connector>
</Service>
</Server>
A sample of our isapi_redirect.log. A full chunk of it can be viewed at http://trasper.com/files/isapi_redirect.log.txt.
The problem (in this example) happened right about at 11:41pm as far as we can tell.
[Wed Jun 25 23:40:34.503 2014] [10012:912] [info] ajp_send_request::jk_ajp_common.c (1658): (Instance_Codebase) all endpoints are disconnected, detected by connect check (27), cping (0), send (0)
[Wed Jun 25 23:40:34.504 2014] [10012:1396] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1313): (Instance_Codebase) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8014) is down (errno=54)
[Wed Jun 25 23:40:34.820 2014] [10012:1396] [error] ajp_get_reply::jk_ajp_common.c (2190): (Instance_Codebase) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Wed Jun 25 23:40:34.823 2014] [10012:1396] [info] ajp_service::jk_ajp_common.c (2692): (Instance_Codebase) sending request to tomcat failed (recoverable), (attempt=1)
[Wed Jun 25 23:40:34.708 2014] [10012:7880] [error] ajp_get_reply::jk_ajp_common.c (2190): (Instance_Codebase) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Wed Jun 25 23:40:40.477 2014] [10012:2296] [info] ajp_connect_to_endpoint::jk_ajp_common.c (1047): Failed opening socket to (127.0.0.1:8014) (errno=61)
[Wed Jun 25 23:40:40.364 2014] [10012:8256] [error] ajp_service::jk_ajp_common.c (2711): (Instance_Codebase) connecting to tomcat failed.
[Wed Jun 25 23:40:40.825 2014] [10012:7060] [error] HttpExtensionProc::jk_isapi_plugin.c (2309): service() failed with http error 503
[Wed Jun 25 23:40:40.877 2014] [10012:10364] [error] ajp_send_request::jk_ajp_common.c (1669): (Instance_Codebase) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=61)
[Wed Jun 25 23:40:40.965 2014] [10012:10364] [info] ajp_service::jk_ajp_common.c (2692): (Instance_Codebase) sending request to tomcat failed (recoverable), because of error during request sending (attempt=1)
[Wed Jun 25 23:40:40.857 2014] [10012:1020] [error] HttpExtensionProc::jk_isapi_plugin.c (2309): service() failed with http error 503
Upvotes: 4
Views: 1730
Reputation: 163
Thanks everyone for the input and assistance. As of today, we’ve been running WSOD free for 4+ days and counting.
We are still not sure what kicked off the problem, it might have just been a tipping point in web traffic, but I am pretty confident we have it under control now.
By default, when a connector is created using the Web Service Configuration Tool (wsconfig.exe) the connection pool is set to 250 connections, but this is not reflected in the server.xml configuration by default as well. We changed the AJP/1.3 connector to specify a matching max threads value as well as added a 60 second connection timeout as they are indefinite otherwise.
We also adjusted the workers.properties file to specify the connection_pool_size and the connection_pool_timeout to match as well.
The previous default settings seemed to match up with the isapi_redirect.log where we would see that every time we got right about to 200 connections tomcat would stall. Matching up all these setting seem to help.
After the configurations changes, we deleted and the recreated the connector itself from the instance. This way we are 100% sure that the connector is up to date with the latest changes from all the Server Updates.
We also then restarted the website in IIS, but we had to ensure that the w3wp.exe process for the instance was reset as well (we killed the process and let it restart).
Then we brought everything back up and have not had any problems since.
Thanks again for the assistance both here and on the Adobe forums; it helped us focus in on some of our issues. I’ll be sure to update this post if any other information comes to light. I’m pretty sure these steps will help anyone having connector/tomcat performance issues.
Here are some of the great resources we were able to find that helped us out a lot:
1.) server.xml
Changed
<Connector port="8014" protocol="AJP/1.3" redirectPort="8446" tomcatAuthentication="false">
to
<Connector port="8014" protocol="AJP/1.3" redirectPort="8447" tomcatAuthentication="false" maxThreads="250" connectionTimeout="60000">
2.) workers.properties
Set (to ensure it matched our # of connections)
worker.Instance_Codebase.max_reuse_connections=250
Added lines
worker.Instance_Codebase.connection_pool_size=250
worker.Instance_Codebase.connection_pool_timeout=60
3.) Deleted the existing connector, then re-created it using the Web Server Configuration Tool (wsconfig.exe) for the instance (Be sure to Run As Administrator!).
Also note that rebuilding the connector will likely require you to reenter the above changes to your workers.properties file.
4.) Restart the IIS site, which included ensuring that the w3wp.exe process for the site is stopped/killed and restarted.
5.) Start the instance and IIS site back up.
Upvotes: 1
Reputation: 2616
I believe this is likely related to Tomcat and not ColdFusion. There are number of posts around the Internet about empty responses with Tomcat when Tomcat has an error. Even one bug fix in an earlier version of Tomcat (2011). ColdFusion customized Tomcat, so it's up to Adobe to bring all changes in and spit them out as hotfixes. I'm not sure which version of Tomcat Adobe used when they started customizing it (perhaps in 2010 or 2011) or how easy it is for them to retrofit patches. There is a similar issue with Application pools and Tomcat on the Adobe forums where Tomcat has the patch, but Adobe did not integrate it into their version of tomcat. https://forums.adobe.com/thread/1023068?start=40&tstart=0
Here is an example of a bug fix on tomcat: https://issues.apache.org/bugzilla/show_bug.cgi?id=51550
I remember seeing another post regarding tomcat having it's default error page incorrectly set to "" (errorPage="") rather than an actual error page, which will push up an empty response.
This would also explain why you can't trap the error in ColdFusion and IIS just serves out a 200.
So, the answer in this case is a bit of a mystery, you can automatically have your web server layer retry empty responses in the hopes that they will work since they usually are good on a page refresh, but this also has the potential to exacerbate any catastrophe. However, it's still a good workaround. You could also try to figure out if Adobe has any solutions for updating Tomcat.
Anit would have the ultimate authority here, my answer is mostly speculation.
Upvotes: 0
Reputation: 26
Try commenting out the onError() method in Application.cfc. Then your white screen of death, will display an error message, which may help you debug what's going on.
Upvotes: 0
Reputation: 1228
You can ignore most of the entries in the log, as they are info from Tomcat. What I noticed as cocerns, are Error 502 (Bad Gateway) and 503 (Service unavailable) alternatively. The logs still have info/error and not debug information. can you change the log level to "debug" from "info" and restart IIS.
Also, your site's connector needs tuning as well. You may refer http://blogs.coldfusion.com/post.cfm/coldfusion-11-iis-connector-tuning. This is applicable for CF10 as well. You can enable metric logging (Debugging & Logging>Debug Output Settings)and then tune the connectors. Use the Current Thread Count as an input to the connection_pool_size and then set the max_reuse_connections.
Upvotes: 0