Alexander Urmuzov
Alexander Urmuzov

Reputation: 139

AppEngine sudden 503 errors

We are experiencing some strange AppEngine behaviour, when application stops serving requests, until it is redeployed. This looks like flat areas on this graph: CPU Utilization When this happens, application still would serve static content, but all servlets and Google Cloud Endpoints will return 503 error.

Recently I added Endpoint monitoring from StackDriver which is checking very simple servlet:

public class HealthCheckServlet extends HttpServlet {

    @Override
    public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        PrintWriter writer = resp.getWriter();
        writer.print("ok");
    }
}

Results confused me much more:

...
2015-01-19 18:05:04 UTC - failed 
2015-01-19 18:10:02 UTC - resolved - gave an error for 4 minutes 
2015-01-19 18:45:04 UTC - failed   - worked for 35 minutes  
2015-01-19 18:49:03 UTC - resolved - gave an error for 3 minutes 
2015-01-19 18:51:03 UTC - failed   - worked for 2 minutes 
2015-01-19 19:24:03 UTC - resolved - gave an error for 33 minutes 
2015-01-19 19:31:07 UTC - failed   - worked for 7 minutes 
2015-01-19 19:48:02 UTC - resolved - gave an error for 16 minutes 
2015-01-19 20:19:04 UTC - failed   - worked for 31 minutes
2015-01-20 03:25:03 UTC - resolved - gave an error for 425 minutes
2015-01-20 03:46:05 UTC - failed   - worked for 21 minutes
2015-01-20 06:50:04 UTC - resolved - gave an error for 183 minutes
2015-01-20 07:21:04 UTC - failed   - worked for 31 minutes
2015-01-20 07:26:02 UTC - resolved - gave an error for 4 minutes
2015-01-20 09:21:02 UTC - failed   - worked for 115 minutes
2015-01-20 09:26:02 UTC - resolved - gave an error for 5 minutes
2015-01-20 14:25:02 UTC - failed   - worked for 301 minutes    
2015-01-20 14:31:02 UTC - resolved - gave an error for 6 minutes
2015-01-20 15:09:02 UTC - failed   - worked for 28 minutes    
2015-01-20 15:14:05 UTC - resolved - gave an error for 5 minutes

Looks like it is working for 50% of time and failing for 50% of time or even more!

Don't know if it is important or not, but our application uses Managed VMs with standard java7 runtime and is heavy Cron and TaskQueue dependent.

Update:

Some OOM investigation.

I switched one of stuck VMs to be managed by user and checked all logs in /var/log for OOM - found nothing. Then I checked if any java process was running - no java processes was running. That's looking strange, because if there is not java process to respond to health checks it should be restarted, but this particular VM was started >5 hours ago.

Then I switched this VM back to managed by Google, GAE restarts it, then I switched it back to managed by user and checked java processes - this time there was one which occupied 77% of memory and increasing. Load average starts to increase until its peak of 90, followed by VM reboot.

It's obvious now that I have issues with memory, but I didn't found anything memory-related in logs. Neither in GAE console, nor on the server.

My current assumption is that auto-scaling process is failing after some time. Maybe there are too many VM restarts related to OOM or some unexpected behaviour from VMs. Otherwise how to explain that VM without java process is not being restarted for several hours?

To check this hypothisis I need to find some auto-scaling logs, which I still can not find.

Also it would be great to see memory usage for Managed VMs, but looks like it is not available nether from GAE console, nor from StackDriver. Or I can't find it.

Upvotes: 1

Views: 584

Answers (1)

Peter Knego
Peter Knego

Reputation: 80350

Yeah, at the moment Managed VMs are beta, although sometimes they feel more like alpha. Btw, we are experiencing similar situation.

It seems we have a problem with memory running out (and getting OOM) which makes JVM totally unresponsive - the health check is also unavailable. The result is that the instance is getting restarted by the GAE - this takes about 4-5 minutes.

Since your instance serves a lot of cron and Task Queue requests, my assumption would be that it runs long-running tasks, that might eat a lot of memory.

Hint: enable healthcheck and make it actually perform some meaningful work - instantiating some classes, check some internal state, etc.. Also set healthcheck settings to some sensible values, so that server errors are acted upon in about a minute. For example - the default settings listed in the docs will start restarting an unresponsive instance in about 10 minutes (5 sec + 4 sec) * 60 = 9 minutes + time to restart (2-3 minutes).

Upvotes: 2

Related Questions