Alex J
Alex J

Reputation: 3183

Postgres not starting on swarm server reboot

I'm trying to run an app using docker swarm. The app is designed to be completely local running on a single computer using docker swarm.

If I SSH into the server and run a docker stack deploy everything works, as seen here running docker service ls:

result of docker service ls. all services up

When this deployment works, the services generally go live in this order:

  1. Registry (a private registry)
  2. Main (an Nginx service) and Postgres
  3. All other services in random order (all Node apps)

The problem I am having is on reboot. When I reboot the server, I pretty consistently have the issue of the services failing with this result:

after server reboot

I am getting some errors that could be helpful.

In Postgres: docker service logs APP_NAME_postgres -f:

docker service logs APP_NAME_postgres -f after reboot

In Docker logs: sudo journalctl -fu docker.service

sudo journalctl -fu docker.service after reboot

Update: June 5th, 2019 Also, By request from a GitHub issue docker version output:

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

And docker info output:

Containers: 28
 Running: 9
 Paused: 0
 Stopped: 19
Images: 14
Server Version: 18.09.5
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: pbouae9n1qnezcq2y09m7yn43
 Is Manager: true
 ClusterID: nq9095ldyeq5ydbsqvwpgdw1z
 Managers: 1
 Nodes: 1
 Default Address Pool: 10.0.0.0/8  
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 1
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.0.47
 Manager Addresses:
  192.168.0.47:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-50-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.68GiB
Name: oeemaster
ID: 76LH:BH65:CFLT:FJOZ:NCZT:VJBM:2T57:UMAL:3PVC:OOXO:EBSZ:OIVH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

And finally, My docker swarm stack/compose file:

secrets:
  jwt-secret:
    external: true
  pg-db:
    external: true
  pg-host:
    external: true
  pg-pass:
    external: true
  pg-user:
    external: true
  ssl_dhparam:
    external: true
services:
  accounts:
    depends_on:
    - postgres
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      JWT_SECRET_FILE: /run/secrets/jwt-secret
      PG_DB_FILE: /run/secrets/pg-db
      PG_HOST_FILE: /run/secrets/pg-host
      PG_PASS_FILE: /run/secrets/pg-pass
      PG_USER_FILE: /run/secrets/pg-user
    image: 127.0.0.1:5000/local-oee-master-accounts:v0.8.0
    secrets:
    - source: jwt-secret
    - source: pg-db
    - source: pg-host
    - source: pg-pass
    - source: pg-user
  graphs:
    depends_on:
    - postgres
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      PG_DB_FILE: /run/secrets/pg-db
      PG_HOST_FILE: /run/secrets/pg-host
      PG_PASS_FILE: /run/secrets/pg-pass
      PG_USER_FILE: /run/secrets/pg-user
    image: 127.0.0.1:5000/local-oee-master-graphs:v0.8.0
    secrets:
    - source: pg-db
    - source: pg-host
    - source: pg-pass
    - source: pg-user
  health:
    depends_on:
    - postgres
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      PG_DB_FILE: /run/secrets/pg-db
      PG_HOST_FILE: /run/secrets/pg-host
      PG_PASS_FILE: /run/secrets/pg-pass
      PG_USER_FILE: /run/secrets/pg-user
    image: 127.0.0.1:5000/local-oee-master-health:v0.8.0
    secrets:
    - source: pg-db
    - source: pg-host
    - source: pg-pass
    - source: pg-user
  live-data:
    depends_on:
    - postgres
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    image: 127.0.0.1:5000/local-oee-master-live-data:v0.8.0
    ports:
    - published: 32000
      target: 80
  main:
    depends_on:
    - accounts
    - graphs
    - health
    - live-data
    - point-logs
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      MAIN_CONFIG_FILE: nginx.local.conf
    image: 127.0.0.1:5000/local-oee-master-nginx:v0.8.0
    ports:
    - published: 80
      target: 80
    - published: 443
      target: 443
  modbus-logger:
    depends_on:
    - point-logs
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      CONTROLLER_ADDRESS: 192.168.2.100
      SERVER_ADDRESS: http://point-logs
    image: 127.0.0.1:5000/local-oee-master-modbus-logger:v0.8.0
  point-logs:
    depends_on:
    - postgres
    - registry
    deploy:
      restart_policy:
        condition: on-failure
    environment:
      ENV_TYPE: local
      PG_DB_FILE: /run/secrets/pg-db
      PG_HOST_FILE: /run/secrets/pg-host
      PG_PASS_FILE: /run/secrets/pg-pass
      PG_USER_FILE: /run/secrets/pg-user
    image: 127.0.0.1:5000/local-oee-master-point-logs:v0.8.0
    secrets:
    - source: pg-db
    - source: pg-host
    - source: pg-pass
    - source: pg-user
  postgres:
    depends_on:
    - registry
    deploy:
      restart_policy:
        condition: on-failure
        window: 120s
    environment:
      POSTGRES_PASSWORD: password
    image: 127.0.0.1:5000/local-oee-master-postgres:v0.8.0
    ports:
    - published: 5432
      target: 5432
    volumes:
    - /media/db_main/postgres_oee_master:/var/lib/postgresql/data:rw
  registry:
    deploy:
      restart_policy:
        condition: on-failure
    image: registry:2
    ports:
    - mode: host
      published: 5000
      target: 5000
    volumes:
    - /mnt/registry:/var/lib/registry:rw
version: '3.2'

Things I've tried

Current Workaround

Currently, I have hacked together a solution that is working just so I can move on to the next thing. I just wrote a shell script called recreate.sh that will kill the failed first boot version of the server, wait for it to break down, and the "manually" run docker stack deploy again. I am then setting the script to run at boot with crontab @reboot. This is working for shutdowns and reboots, but I don't accept this as the proper answer, so I won't add it as one.

Upvotes: 5

Views: 747

Answers (1)

Miq
Miq

Reputation: 4279

It looks to me that you need to check is who/what kills postgres service. From logs you posted it seems that postrgres receives smart shutdown signal. Then, postress stops gently. Your stack file has restart policy set to "on-failure", and since postres process stops gently (exit code 0), docker does not consider this as failue and as instructed, it does not restart.

In conclusion, I'd recommend changing restart policy to "any" from "on-failure".

Also, have in mind that "depends_on" settings that you use are ignored in swarm and you need to have your services/images own way of ensuring proper startup order or to be able to work when dependent services are not up yet.

There's also thing you could try - healthchecks. Perhaps your postgres base image has a healthcheck defined and it's terminating container by sending a kill signal to it. And as wrote earlier, postgres shuts down gently and there's no error exit code and restart policy does not trigger. Try disabling healthcheck in yaml or go to dockerfiles to see for the healthcheck directive and figure out why it triggers.

Upvotes: 2

Related Questions