blackbird
blackbird

Reputation: 146

[HTCONDOR][kubernetes / k8s] : Unable to start minicondor image within k8s - condor_master not working

POST EDIT

The issue is due to :

PSP (Pod security policy) By default escalation is not permit for my condor user. That is why it is not working. because the supervisord is running as root user and try to write logs and start condor collector as root and not as an other user (i.e condor)

Description

The mini-condor base image is not starting as expected on kubernetes rancher pod.

I am using :

ps : the image was working perfectly on :

  • a local env
  • minikube default installation

I am running it as a simple deployment :

When the pod is starting, the Kubernetes default log file is displaying :

2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly

Here is a brief cmd and output result:

CMD output
condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`

1)first try to fix the issue

I decided to customize the image, but the error is the same

The docker images used to try to fix the permission issue

FROM htcondor/mini:9.2-el7

RUN condor_master

RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor

RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog

RUN chmod 775 -R /var/
apiVersion: apps/v1
kind: Deployment
metadata:
  name: htcondor-mini--all-in-one
  namespace: grafana-exporter
    spec:
      containers:
      - image: <custom_image>
        imagePullPolicy: Always
        name: htcondor-mini--all-in-one
        resources: {}
        securityContext:
          capabilities: {}
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
      dnsConfig: {}
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Here is a brief cmd and output result:

CMD output
condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
ls -ld /var/ drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/
ls -ld /var/log/ drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/
ls -ld /var/log/condor drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor
ls -ld /var/log/condor/MasterLog -rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog

MasterLog content :

10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources: 
10/07/21 11:23:21    /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21    /etc/condor/config.d/00-minicondor
10/07/21 11:23:21    /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21    /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started

A huge thanks for reading. Hope it will help many other people.

Upvotes: 4

Views: 363

Answers (1)

blackbird
blackbird

Reputation: 146

Cause of the issue

The issue is due to :

PSP policy (Pod security policy) By default escalation is not permit for my condor user.

SOLUTION

THE BEST SOLUTION I found at the moment is to run EVERYTHING as condor user and give the permisssion to the condor users. To do so you need :

  • In the supervisord.conf : Run supervisor as condor user
  • In the supervisord.conf : run log and socket in /tmp
  • In the Dockerfile : Change the owner of most of folder by condor
  • In the deployment.yamlset the ID to 64 (condor user)

Dockerfile

FROM htcondor/mini:9.2-el7

# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor

# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf

# Need to run the cmd to create all dir
RUN condor_master

# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&\
 chown -R restd:restd /home/restd &&\
 chmod 755 -R /home/restd

supervisord.conf:

[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp

# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock

[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log

deployment.yaml

apiVersion: apps/v1
kind: Deployment
spec:
      containers:
      - image: <condor-image>
        imagePullPolicy: Always
        name: htcondor-exporter
        ports:
        - containerPort: 8080
          name: myport
          protocol: TCP
        resources: {}
        securityContext:
          capabilities: {}
          runAsNonRoot: false
          runAsUser: 64
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true

Upvotes: 0

Related Questions