Reputation: 141
We are running node server on GAE and for some reason a few times a day our server is offline (sometimes it can take a few mins to come back online).
Requests are the same throughout the day and there is also no exception that would be the cause of restart. There is no spike in requests or any special requests that could cause it.
Log when it happens:
2020-04-18T23:48:51.881806Z [GET /v1/util/example [36m304 [35.262 ms - -[ A
2020-04-18T23:50:17.119906Z [start] 2020/04/18 23:50:17.119185 Quitting on terminated signal A
2020-04-18T23:50:17.175632Z [start] 2020/04/18 23:50:17.175267 Start program failed: user application failed with exit code -1 (refer to stdout/stderr logs for more detail): signal: terminated
2020-04-18T23:51:38.772388Z GET 304 173 B 3.3 s Example-V2/3.1.13 (com.example.app; build:1; iOS 13.4.0) Alamofire/5.1.0 /v1/util/example GET 304 173 B 3.3 s Example-V2/3.1.13 (com.example.app; build:1; iOS 13.4.0) Alamofire/5.1.0 5e9b928a00ff0bc9244f94194c0001737e737065616b2d76322d32613166310001737065616b2d6170693a323032303034303374303630343431000100
2020-04-18T23:51:38.786760Z GET 404 324 B 2.4 s Unknown /_ah/start GET 404 324 B 2.4 s Unknown 5e9b928a00ff0c014898f5c27f0001737e737065616b2d76322d32613166310001737065616b2d6170693a323032303034303374303630343431000100
2020-04-18T23:51:39.529080Z [start] 2020/04/18 23:51:39.511828 No entrypoint specified, using default entrypoint: /serve
2020-04-18T23:51:39.529642Z [start] 2020/04/18 23:51:39.528742 Starting app
2020-04-18T23:51:39.529968Z [start] 2020/04/18 23:51:39.529100 Executing: /bin/sh -c exec /serve
2020-04-18T23:51:39.590085Z [start] 2020/04/18 23:51:39.589751 Waiting for network connection open. Subject:"app/invalid" Address:127.0.0.1:8080
2020-04-18T23:51:39.590571Z [start] 2020/04/18 23:51:39.590347 Waiting for network connection open. Subject:"app/valid" Address:127.0.0.1:8081
2020-04-18T23:51:39.764383Z [serve] 2020/04/18 23:51:39.763656 Serve started.
2020-04-18T23:51:39.764935Z [serve] 2020/04/18 23:51:39.764544 Args: {runtimeName:nodejs10 memoryMB:1024 positional:[]}
2020-04-18T23:51:39.766562Z [serve] 2020/04/18 23:51:39.765904 Running /bin/sh -c exec node server.js
2020-04-18T23:51:41.072621Z [start] 2020/04/18 23:51:41.071895 Wait successful. Subject:"app/valid" Address:127.0.0.1:8081 Attempts:296 Elapsed:1.481194491s
2020-04-18T23:51:41.072978Z Express server started on port: 8081
2020-04-18T23:51:41.073008Z [start] 2020/04/18 23:51:41.072411 Starting nginx
2020-04-18T23:51:41.085901Z [start] 2020/04/18 23:51:41.085451 Waiting for network connection open. Subject:"nginx" Address:127.0.0.1:8080
2020-04-18T23:51:41.132064Z [start] 2020/04/18 23:51:41.131572 Wait successful. Subject:"nginx" Address:127.0.0.1:8080 Attempts:9 Elapsed:45.911234ms
2020-04-18T23:51:41.170786Z [GET /_ah/start [33m404 [11.865 ms - 61[
There is always more than 70% memory free, so that could not be the issue. Only noticed very high CPU utilization when it restarts occurs (10x higher than normally).
In the bottom picture you can clearly see when the restarts happen:
This is my app.yaml
runtime: nodejs10
instance_class: B4
service: example-api
basic_scaling:
max_instances: 1
idle_timeout: 30m
handlers:
- url: .*
secure: always
script: auto
This is happening on our production server, so any help would be more than welcome.
Thanks!
Upvotes: 1
Views: 2160
Reputation: 488
Reading this document, it is mentioned that even though they try to keep basic and manual scaling instances running indefinitely, they are sometimes restarted for maintenance or they might fail due to some other reasons. That is why keeping your max instances as 1 is not considered best practice as it is prone to all of these failures. As mentioned in the other answer, I would also recommend to increase the number of instances so the likelyhood of more failing or being restarted at the same time is lower.
Upvotes: 1
Reputation: 1169
We had the same problem when we migrated our Ruby on Rails app to Google App Engine Standard a year ago. After emailing back and forth with Google Cloud Support, they suggested: "increasing the minimum number of instances will help because you will have more “backup” instances."
At the time we had two instances, and since we upped it the three instances, we have had no downtime related to unexpected server restarts.
We are still not sure why our servers are sometimes deemed unhealthy and restarted by App Engine, but having more instances can help you to avoid downtime in the short run while you investigate the underlying issue.
Upvotes: 1