[HTCONDOR][kubernetes / k8s] : Unable to start minicondor image within k8s - condor_master not working

Question

POST EDIT

The issue is due to :

PSP (Pod security policy) By default escalation is not permit for my condor user. That is why it is not working. because the supervisord is running as root user and try to write logs and start condor collector as root and not as an other user (i.e condor)

Description

The mini-condor base image is not starting as expected on kubernetes rancher pod.

I am using :

This image : https://hub.docker.com/r/htcondor/mini In a custom namespace in rancher (k8s)

ps : the image was working perfectly on :

a local env

minikube default installation

I am running it as a simple deployment :

When the pod is starting, the Kubernetes default log file is displaying :

2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly

Here is a brief cmd and output result:

CMD	output
`condor_status`	`CEDAR:6001:Failed to connect to <127.0.0.1:9618>`
`condor_master`	ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`

1)first try to fix the issue

I decided to customize the image, but the error is the same

The docker images used to try to fix the permission issue

Image :

FROM htcondor/mini:9.2-el7

RUN condor_master

RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor

RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog

RUN chmod 775 -R /var/

Kubernetes - rancher
yaml file :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: htcondor-mini--all-in-one
  namespace: grafana-exporter
    spec:
      containers:
      - image: 
        imagePullPolicy: Always
        name: htcondor-mini--all-in-one
        resources: {}
        securityContext:
          capabilities: {}
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
      dnsConfig: {}
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Here is a brief cmd and output result:

CMD	output
`condor_status`	`CEDAR:6001:Failed to connect to <127.0.0.1:9618>`
`condor_master`	ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
`ls -ld /var/`	drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/
`ls -ld /var/log/`	drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/
`ls -ld /var/log/condor`	drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor
`ls -ld /var/log/condor/MasterLog`	-rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog

MasterLog content :

10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local: class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources: 
10/07/21 11:23:21    /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21    /etc/condor/config.d/00-minicondor
10/07/21 11:23:21    /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21    /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started

A huge thanks for reading. Hope it will help many other people.

blackbird · Accepted Answer

Cause of the issue

The issue is due to :

PSP policy (Pod security policy) By default escalation is not permit for my condor user.

SOLUTION

THE BEST SOLUTION I found at the moment is to run EVERYTHING as condor user and give the permisssion to the condor users. To do so you need :

In the supervisord.conf : Run supervisor as condor user
In the supervisord.conf : run log and socket in /tmp
In the Dockerfile : Change the owner of most of folder by condor
In the deployment.yamlset the ID to 64 (condor user)

Dockerfile

FROM htcondor/mini:9.2-el7

# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor

# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf

# Need to run the cmd to create all dir
RUN condor_master

# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&\
 chown -R restd:restd /home/restd &&\
 chmod 755 -R /home/restd

supervisord.conf:

[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp

# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock

[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log

deployment.yaml

apiVersion: apps/v1
kind: Deployment
spec:
      containers:
      - image: 
        imagePullPolicy: Always
        name: htcondor-exporter
        ports:
        - containerPort: 8080
          name: myport
          protocol: TCP
        resources: {}
        securityContext:
          capabilities: {}
          runAsNonRoot: false
          runAsUser: 64
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true