Deploying a Keycloak HA cluster to kubernetes | Pods are not discovering each other

Question

I'm trying to deploy a HA Keycloak cluster (2 nodes) on Kubernetes (GKE). So far the cluster nodes (pods) are failing to discover each other in all the cases as of what I deduced from the logs. Where the pods initiate and the service is up but they fail to see other nodes.

Components

PostgreSQL DB deployment with a clusterIP service on the default port.
Keycloak Deployment of 2 nodes with the needed ports container ports 8080, 8443, a relevant clusterIP, and a service of type LoadBalancer to expose the service to the internet

Logs Snippet:

INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-4) ISPN000078: Starting JGroups channel ejb
INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-4) ISPN000094: Received new cluster view for channel ejb: [keycloak-567575d6f8-c5s42|0] (1) [keycloak-567575d6f8-c5s42]
INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-1) ISPN000094: Received new cluster view for channel ejb: [keycloak-567575d6f8-c5s42|0] (1) [keycloak-567575d6f8-c5s42]
INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-3) ISPN000094: Received new cluster view for channel ejb: [keycloak-567575d6f8-c5s42|0] (1) [keycloak-567575d6f8-c5s42]
INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-4) ISPN000079: Channel ejb local address is keycloak-567575d6f8-c5s42, physical addresses are [127.0.0.1:55200]
.
.
.
INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Keycloak 15.0.2 (WildFly Core 15.0.1.Final) started in 67547ms - Started 692 of 978 services (686 services are lazy, passive or on-demand)
INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management
INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990

And as we can see in the above logs the node sees itself as the only container/pod ID

Trying KUBE_PING protocol

I tried using the kubernetes.KUBE_PING protocol for discovery but it didn't work and the call to the kubernetes downward API. With a 403 Authorization error in the logs (BELOW IS PART OF IT):

Server returned HTTP response code: 403 for URL: https://[SERVER_IP]:443/api/v1/namespaces/default/pods

At this point, I was able to log in to the portal and do the changes but it was not yet an HA cluster since changes were not replicated and the session was not preserved, in other words, if I delete the pod that I was using I was redirected to the other with a new session (as if it was a separate node)

Trying DNS_PING protocol

When I tried DNS_PING things were different I had no Kubernetes downward API issues but I was not able to log in.

In detail, I was able to reach the login page normally, but when I enter my credentials and try logging in the page tries loading but gets me back to the login page with no logs in the pods in this regard.

Below are some of the references I resorted to over the past couple of days:

My Yaml Manifest files

Postgresql Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:13
          imagePullPolicy: IfNotPresent
          ports:
          - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              value: "postgres"
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
    targetPort: 5432

Keycloak HA cluster Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
  labels:
    app: keycloak
spec:
  replicas: 2 
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app: keycloak
  template:
    metadata:
      labels:
        app: keycloak
    spec:
      containers:
      - name: keycloak
        image: jboss/keycloak
        env:
            - name: KEYCLOAK_USER 
              value: admin
            - name: KEYCLOAK_PASSWORD 
              value: admin123
            - name: DB_VENDOR
              value: POSTGRES
            - name: DB_ADDR
              value: "postgres" 
            - name: DB_PORT
              value: "5432"
            - name: DB_USER
              value: "postgres"
            - name: DB_PASSWORD
              value: "postgres"
            - name: DB_SCHEMA
              value: "public"
            - name: DB_DATABASE
              value: "keycloak"
#            - name: JGROUPS_DISCOVERY_PROTOCOL
#              value: kubernetes.KUBE_PING
#            - name: JGROUPS_DISCOVERY_PROPERTIES
#              value: dump_requests=true,port_range=0,namespace=default
#              value: port_range=0,dump_requests=true
            - name: JGROUPS_DISCOVERY_PROTOCOL
              value: dns.DNS_PING
            - name: JGROUPS_DISCOVERY_PROPERTIES
              value: "dns_query=keycloak"
            - name: CACHE_OWNERS_COUNT
              value: '2'
            - name: CACHE_OWNERS_AUTH_SESSIONS_COUNT
              value: '2'
            - name: PROXY_ADDRESS_FORWARDING
              value: "true"
        ports:
            - name: http
              containerPort: 8080
            - name: https
              containerPort: 8443

---
apiVersion: v1
kind: Service
metadata:
  name: keycloak
  labels:
    app: keycloak
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 80
      targetPort: 8080
    - name: https
      port: 443
      targetPort: 8443
  selector:
    app: keycloak
---
apiVersion: v1
kind: Service
metadata:
  name: keycloak-np
  labels:
    app: keycloak
spec:
  type: LoadBalancer 
  ports:
    - name: http
      port: 80
      targetPort: 8080
    - name: https
      port: 443
      targetPort: 8443
  selector:
    app: keycloak

IMPORTANT NOTE

I tried both protocols with and without the database setup.
The above yaml has all the discovery protocol combinations I tried each at a time (the ones commented)

motob&#243;i · Accepted Answer

For Keycloak 17 and newer

By default, those versions use DNS_PING as the discovery mechanism for JGroups (the underlying cluster mechanism) but you still need to activate it.

You'll need:

a headless service pointing to your keycloak pods (a headless service is just a normal service but with ClusterIP: none)
env KC_CACHE_STACK=kubernetes (to activate the kubernetes jgroup configs) and JAVA_OPTS_APPEND=-Djgroups.dns.query= (to tell it how to find the other keycloak pods).

That way, when starting up, jgroups will issue a dns query for (example: keycloak-headless.my_namespace.svc.cluster.local) and the response will be the IP of all pods associated to the headless service.

JGroups will then contact every IP in communication port and stablish the cluster.

UPDATE 2022-08-01: This configuration below is for the legacy version of keycloak (or versions up to 16). From 17 on Keycloak migrated to the Quarkus distribution and the configuration is different, as above.

For Keycloak up to 16

The way KUBE_PING works is similar to running kubectl get pods inside one Keycloak pod to find the other Keycloak pods' IPs and then trying to connect to them one by one. However, Keycloak does this by querying the Kubernetes API directly instead of using kubectl.

To access the Kubernetes API, Keycloak needs credentials in the form of an access token. You can pass your token directly, but this is not very secure or convenient.

Kubernetes has a built-in mechanism for injecting a token into a pod (or the software running inside that pod) to allow it to query the API. This is done by creating a service account, giving it the necessary permissions through a RoleBinding, and setting that account in the pod configuration.

The token is then mounted as a file at a known location, which is hardcoded and expected by all Kubernetes clients. When the client wants to call the API, it looks for the token at that location.

You can get a deeper look at the Service Account mechanism in the documentation.

In some situations, you may not have the necessary permissions to create RoleBindings. In this case, you can ask an administrator to create the service account and RoleBinding for you or pass your own user's token (if you have the necessary permissions) through the SA_TOKEN_FILE environment variable.

You can create the file using a secret or configmap, mount it to the pod, and set SA_TOKEN_FILE to the file location. Note that this method is specific to JGroups library (used by Keycloak) and the documentation is here.

If you do have permissions to create service accounts and RoleBindings in the cluster:

An example (not tested):

export TARGET_NAMESPACE=default

# convenient method to create a service account 
kubectl create serviceaccount keycloak-kubeping-service-account -n $TARGET_NAMESPACE

# No convenient method to create Role and RoleBindings
# Needed to explicitly define them.
cat <


On the deployment, you set the serviceAccount:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  template:
    spec:
      serviceAccount: keycloak-kubeping-service-account
      serviceAccountName: keycloak-kubeping-service-account
      containers:
      - name: keycloak
        image: jboss/keycloak
        env:
#          ...
            - name: JGROUPS_DISCOVERY_PROTOCOL
              value: kubernetes.KUBE_PING
            - name: JGROUPS_DISCOVERY_PROPERTIES
              value: dump_requests=true
            - name: KUBERNETES_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
#          ...

dump_requests=true will help you debug Kubernetes requests. Better to have it false in production. You can use namespace= instead of KUBERNETES_NAMESPACE, but that's a handy way the pod has to autodetect the namespace it's running at.

Please note that KUBE_PING will find all pods in the namespace, not only keycloak pods, and will try to connect to all of them. Of course, if your other pods don't care about that, it's OK.