Reputation: 3183
I'm trying to run an app using docker swarm. The app is designed to be completely local running on a single computer using docker swarm.
If I SSH into the server and run a docker stack deploy everything works, as seen here running docker service ls
:
When this deployment works, the services generally go live in this order:
The problem I am having is on reboot. When I reboot the server, I pretty consistently have the issue of the services failing with this result:
I am getting some errors that could be helpful.
In Postgres: docker service logs APP_NAME_postgres -f
:
In Docker logs: sudo journalctl -fu docker.service
Update: June 5th, 2019
Also, By request from a GitHub issue docker version
output:
Client:
Version: 18.09.5
API version: 1.39
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:43:57 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false
And docker info
output:
Containers: 28
Running: 9
Paused: 0
Stopped: 19
Images: 14
Server Version: 18.09.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: pbouae9n1qnezcq2y09m7yn43
Is Manager: true
ClusterID: nq9095ldyeq5ydbsqvwpgdw1z
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 1
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.0.47
Manager Addresses:
192.168.0.47:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-50-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.68GiB
Name: oeemaster
ID: 76LH:BH65:CFLT:FJOZ:NCZT:VJBM:2T57:UMAL:3PVC:OOXO:EBSZ:OIVH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
And finally, My docker swarm stack/compose file:
secrets:
jwt-secret:
external: true
pg-db:
external: true
pg-host:
external: true
pg-pass:
external: true
pg-user:
external: true
ssl_dhparam:
external: true
services:
accounts:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
JWT_SECRET_FILE: /run/secrets/jwt-secret
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-accounts:v0.8.0
secrets:
- source: jwt-secret
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
graphs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-graphs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
health:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-health:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
live-data:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
image: 127.0.0.1:5000/local-oee-master-live-data:v0.8.0
ports:
- published: 32000
target: 80
main:
depends_on:
- accounts
- graphs
- health
- live-data
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
MAIN_CONFIG_FILE: nginx.local.conf
image: 127.0.0.1:5000/local-oee-master-nginx:v0.8.0
ports:
- published: 80
target: 80
- published: 443
target: 443
modbus-logger:
depends_on:
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
CONTROLLER_ADDRESS: 192.168.2.100
SERVER_ADDRESS: http://point-logs
image: 127.0.0.1:5000/local-oee-master-modbus-logger:v0.8.0
point-logs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
ENV_TYPE: local
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-point-logs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
postgres:
depends_on:
- registry
deploy:
restart_policy:
condition: on-failure
window: 120s
environment:
POSTGRES_PASSWORD: password
image: 127.0.0.1:5000/local-oee-master-postgres:v0.8.0
ports:
- published: 5432
target: 5432
volumes:
- /media/db_main/postgres_oee_master:/var/lib/postgresql/data:rw
registry:
deploy:
restart_policy:
condition: on-failure
image: registry:2
ports:
- mode: host
published: 5000
target: 5000
volumes:
- /mnt/registry:/var/lib/registry:rw
version: '3.2'
Things I've tried
Current Workaround
Currently, I have hacked together a solution that is working just so I can move on to the next thing. I just wrote a shell script called recreate.sh
that will kill the failed first boot version of the server, wait for it to break down, and the "manually" run docker stack deploy again. I am then setting the script to run at boot with crontab @reboot. This is working for shutdowns and reboots, but I don't accept this as the proper answer, so I won't add it as one.
Upvotes: 5
Views: 747
Reputation: 4279
It looks to me that you need to check is who/what kills postgres service. From logs you posted it seems that postrgres receives smart shutdown signal. Then, postress stops gently. Your stack file has restart policy set to "on-failure", and since postres process stops gently (exit code 0), docker does not consider this as failue and as instructed, it does not restart.
In conclusion, I'd recommend changing restart policy to "any" from "on-failure".
Also, have in mind that "depends_on" settings that you use are ignored in swarm and you need to have your services/images own way of ensuring proper startup order or to be able to work when dependent services are not up yet.
There's also thing you could try - healthchecks. Perhaps your postgres base image has a healthcheck defined and it's terminating container by sending a kill signal to it. And as wrote earlier, postgres shuts down gently and there's no error exit code and restart policy does not trigger. Try disabling healthcheck in yaml or go to dockerfiles to see for the healthcheck directive and figure out why it triggers.
Upvotes: 2