Bruno Unna
Bruno Unna

Reputation: 1454

How to run cassandra docker container from nomad?

I want to run a cassandra container from a nomad job. It seems to start, but after a few seconds it dies (it seems to be killed by nomad itself).

If I run the container from the command line, with:

docker run --name some-cassandra -p 9042:9042 -d cassandra:3.0

the container starts flawlessly. But if I create a nomad job like this:

job "cassandra" {

  datacenters = ["dc1"]

  type = "service"

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "cassandra" {
    restart {
      attempts = 2
      interval = "240s"
      delay = "120s"
      mode = "delay"
    }

    task "cassandra" {
      driver = "docker"

      config {
        image = "cassandra:3.0"
        network_mode = "bridge"
        port_map {
          cql = 9042
        }
      }

      resources {
        memory = 2048
        cpu = 800
        network {
          port "cql" {}
        }
      }

      env {
        CASSANDRA_LISTEN_ADDRESS = "${NOMAD_IP_cql}"
      }

      service {
        name = "cassandra"
        tags = ["global", "cassandra"]
        port = "cql"
      }
    }
  }
}

Then it will never start. The nomad's web interface shows nothing in the stdout logs of the created allocation, and the stdin stream only shows Killed.

I know that as this is happening, docker containers are created, and removed after a few seconds. I cannot read the logs of these containers, for when I try (with docker logs <container_id>), all I get is:

Error response from daemon: configured logging driver does not support reading

And the allocation overview shows this message:

12/06/18 14:16:04   Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

According to docker:

If there is no database initialized when the container starts, then a default database will be created. While this is the expected behavior, this means that it will not accept incoming connections until such initialization completes. This may cause issues when using automation tools, such as docker-compose, which start several containers simultaneously.

But I doubt this is the source of my problem, because I've increased the restart stanza values with no effect, and because the task fails after just a few seconds.

Not long ago I experience a somewhat similar problem, with a kafka container, that -it turns out- was not happy because it wanted more memory. But in this case I've provided generous values for memory and for CPU in the resources stanza, and it doesn't seem to make any difference.

My host OS is Arch Linux, with kernel 4.19.4-arch1-1-ARCH. I'm running consul as a systemd service, and the nomad agent with this command line:

sudo nomad agent -dev

What can I possibly be missing? Any help and/or pointers will be appreciated.

Update (2018-12-06 16:26 GMT): by reading in detail the output of the nomad agent, I get that some valuable information can be read at the host's /tmp directory. A snippet of that output:

    2018/12/06 16:03:03 [DEBUG] memberlist: TCP connection from=127.0.0.1:45792
    2018/12/06 16:03:03.180586 [DEBUG] driver.docker: docker pull cassandra:latest succeeded
2018-12-06T16:03:03.184Z [DEBUG] plugin: starting plugin: path=/usr/bin/nomad args="[/usr/bin/nomad executor {"LogFile":"/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/executor.out","LogLevel":"DEBUG"}]"
2018-12-06T16:03:03.185Z [DEBUG] plugin: waiting for RPC address: path=/usr/bin/nomad
2018-12-06T16:03:03.235Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-12-06T16:03:03.235Z address=/tmp/plugin681788273 network=unix
    2018/12/06 16:03:03.253166 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin559865372
    2018/12/06 16:03:03.253196 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin559865372]}
    2018/12/06 16:03:03.253206 [DEBUG] driver.docker: using 2147483648 bytes memory for cassandra
    2018/12/06 16:03:03.253217 [DEBUG] driver.docker: using 800 cpu shares for cassandra
    2018/12/06 16:03:03.253237 [DEBUG] driver.docker: binding directories []string{"/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/alloc:/alloc", "/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/local:/local", "/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/secrets:/secrets"} for cassandra
    2018/12/06 16:03:03.253282 [DEBUG] driver.docker: allocated port 127.0.0.1:29073 -> 9042 (mapped)
    2018/12/06 16:03:03.253296 [DEBUG] driver.docker: exposed port 9042
    2018/12/06 16:03:03.253320 [DEBUG] driver.docker: setting container name to: cassandra-1c315bf2-688c-2c7b-8d6f-f71fec1254f3
    2018/12/06 16:03:03.361162 [INFO] driver.docker: created container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199
    2018/12/06 16:03:03.754476 [INFO] driver.docker: started container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199
    2018/12/06 16:03:03.757642 [DEBUG] consul.sync: registered 1 services, 0 checks; deregistered 0 services, 0 checks
    2018/12/06 16:03:03.765001 [DEBUG] client: error fetching stats of task cassandra: stats collection hasn't started yet
    2018/12/06 16:03:03.894514 [DEBUG] client: updated allocations at index 371 (total 2) (pulled 0) (filtered 2)
    2018/12/06 16:03:03.894584 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2)
    2018/12/06 16:03:05.190647 [DEBUG] driver.docker: error collecting stats from container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199: io: read/write on closed pipe
2018-12-06T16:03:09.191Z [DEBUG] plugin.nomad: 2018/12/06 16:03:09 [ERR] plugin: plugin server: accept unix /tmp/plugin681788273: use of closed network connection
2018-12-06T16:03:09.194Z [DEBUG] plugin: plugin process exited: path=/usr/bin/nomad
    2018/12/06 16:03:09.223734 [INFO] client: task "cassandra" for alloc "1c315bf2-688c-2c7b-8d6f-f71fec1254f3" failed: Wait returned exit code 137, signal 0, and error Docker container exited with non-zero exit code: 137
    2018/12/06 16:03:09.223802 [INFO] client: Restarting task "cassandra" for alloc "1c315bf2-688c-2c7b-8d6f-f71fec1254f3" in 2m7.683274502s
    2018/12/06 16:03:09.230053 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 1 services, 0 checks
    2018/12/06 16:03:09.233507 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
    2018/12/06 16:03:09.296185 [DEBUG] client: updated allocations at index 372 (total 2) (pulled 0) (filtered 2)
    2018/12/06 16:03:09.296313 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2)
    2018/12/06 16:03:11.541901 [DEBUG] http: Request GET /v1/agent/health?type=client (452.678µs)

But the contents of /tmp/NomadClient.../<alloc_id>/... is deceptively simple:

[root@singularity 1c315bf2-688c-2c7b-8d6f-f71fec1254f3]# ls -lR
.:
total 0
drwxrwxrwx 5 nobody nobody 100 Dec  6 15:52 alloc
drwxrwxrwx 5 nobody nobody 120 Dec  6 15:53 cassandra

./alloc:
total 0
drwxrwxrwx 2 nobody nobody 40 Dec  6 15:52 data
drwxrwxrwx 2 nobody nobody 80 Dec  6 15:53 logs
drwxrwxrwx 2 nobody nobody 40 Dec  6 15:52 tmp

./alloc/data:
total 0

./alloc/logs:
total 0
-rw-r--r-- 1 root root 0 Dec  6 15:53 cassandra.stderr.0
-rw-r--r-- 1 root root 0 Dec  6 15:53 cassandra.stdout.0

./alloc/tmp:
total 0

./cassandra:
total 4
-rw-r--r-- 1 root   root   1248 Dec  6 16:19 executor.out
drwxrwxrwx 2 nobody nobody   40 Dec  6 15:52 local
drwxrwxrwx 2 nobody nobody   60 Dec  6 15:52 secrets
drwxrwxrwt 2 nobody nobody   40 Dec  6 15:52 tmp

./cassandra/local:
total 0

./cassandra/secrets:
total 0

./cassandra/tmp:
total 0

Both cassandra.stdout.0 and cassandra.stderr.0 are empty, and the full contents of the executor.out file is:

2018/12/06 15:53:22.822072 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin278120866
2018/12/06 15:55:53.009611 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin242312234
2018/12/06 15:58:29.135309 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin226242288
2018/12/06 16:00:53.942271 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin373025133
2018/12/06 16:03:03.252389 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin559865372
2018/12/06 16:05:19.656317 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin090082811
2018/12/06 16:07:28.468809 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin383954837
2018/12/06 16:09:54.068604 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin412544225
2018/12/06 16:12:10.085157 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin279043152
2018/12/06 16:14:48.255653 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin209533710
2018/12/06 16:17:23.735550 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin168184243
2018/12/06 16:19:40.232181 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin839254781
2018/12/06 16:22:13.485457 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin406142133
2018/12/06 16:24:24.869274 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin964077792

Update (2018-12-06 16:40 GMT): since it's apparent that logging to syslog is desirable for the agent, I've setup and launched a local syslog server, to no avail. And the syslog server receive no message whatsoever.

Upvotes: 3

Views: 1590

Answers (1)

Bruno Unna
Bruno Unna

Reputation: 1454

Problem solved. Its nature is twofold:

  • Nomad's docker driver is (very efficiently) encapsulating the behaviour of the containers, making them at times very silent.

  • Cassandra is very demanding of resources. Much more than I originally thought. I was convinced that 4 GB RAM would be enough for it run comfortably, but as it turns out it needs (at least in my environment) 6 GB.

Disclaimer: I'm actually using now bitnami/cassandra instead of cassandra, because I believe their images are of very high quality, secure and configurable by means of environment variables. This discovery I made using bitnami's image, and I haven't tested how the original one reacts to having this amount of memory.

As to why it doesn't fail when running the container directly from docker's CLI, I think that's because there is no specification of limits when running it that way. Docker simply takes as much memory as it needs for its containers, so if eventually host's memory is insufficient for all containers, the realisation will come much later (and possibly painfully). So this early failure should be a welcome benefit of an orchestration platform as nomad. If there is any complain on my part is that finding what the problem was took so long because of the lack of visibility of the container!

Upvotes: 3

Related Questions