Desolar1um
Desolar1um

Reputation: 115

Slow Knative service creation times

I have Knative v1.10.2 running on a GCP GKE cluster installed using the Knative Operator with Kourier as networking layer. I have the Knative Serving components scaled to 3 pods each. I've been using kperf to test service creation speed. I've run tests generating 10 services that are using the knative-sample/helloworld-go image. I've observed that for the average time taken for a service to be ready is ~45s averaged over multiple tests.

What I'm struggling to grasp here, is why it takes so long, and why so much time is needed for the Configuration to become ready, if the image is already on the node. for example:

[Verbose] Service ksvc-1: Service Configuration Ready Duration is 38s/38.000000s
[Verbose] Service ksvc-1: - Service Revision Ready Duration is 38s/38.000000s
[Verbose] Service ksvc-1:   - Service Deployment Created Duration is 11s/11.000000s
[Verbose] Service ksvc-1:     - Service Pod Scheduled Duration is 0s/0.000000s
[Verbose] Service ksvc-1:     - Service Pod Containers Ready Duration is 5s/5.000000s
[Verbose] Service ksvc-1:       - Service Pod queue-proxy Started Duration is 4s/4.000000s
[Verbose] Service ksvc-1:       - Service Pod user-container Started Duration is 3s/3.000000s
[Verbose] Service ksvc-1:   - Service PodAutoscaler Active Duration is 8s/8.000000s
[Verbose] Service ksvc-1:     - Service ServerlessService Ready Duration is 18s/18.000000s
[Verbose] Service ksvc-1:       - Service ServerlessService ActivatorEndpointsPopulated Duration is 10s/10.000000s
[Verbose] Service ksvc-1:       - Service ServerlessService EndpointsPopulated Duration is 18s/18.000000s
[Verbose] Service ksvc-1: Service Route Ready Duration is 48s/48.000000s
[Verbose] Service ksvc-1: - Service Ingress Ready Duration is 0s/0.000000s
[Verbose] Service ksvc-1:   - Service Ingress Network Configured Duration is 0s/0.000000s
[Verbose] Service ksvc-1:   - Service Ingress LoadBalancer Ready Duration is 0s/0.000000s
[Verbose] Service ksvc-1: Overall Service Ready Duration is 48s/48.000000s

Here we can see that the Service Configuration Ready Duration is around 80% of the overall ready duration. Initially i thought the tag resolution was taking too long, as the output from kn revision list would show most time spent on Resolving Digests, so I have disabled tag resolution for gcr.io, but the result is roughly the same, with the first 2-3 services moving quickly to Deploying, and the rest being stuck for an additional 20-30s on 'Resolving Digests' (configs at the end).

kn revision list
NAME           SERVICE   TRAFFIC   TAGS   GENERATION   AGE         CONDITIONS   READY       REASON
ksvc-0-00001   ksvc-0                     1            <invalid>   0 OK / 0     <unknown>   <unknown>
ksvc-1-00001   ksvc-1                     1            <invalid>   0 OK / 3     Unknown     ResolvingDigests
ksvc-2-00001   ksvc-2                     1            <invalid>   0 OK / 0     <unknown>   <unknown>
ksvc-4-00001   ksvc-4                     1            <invalid>   0 OK / 3     Unknown     ResolvingDigests
ksvc-5-00001   ksvc-5                     1            <invalid>   0 OK / 3     Unknown     ResolvingDigests
ksvc-6-00001   ksvc-6                     1            <invalid>   0 OK / 3     Unknown     ResolvingDigests
ksvc-7-00001   ksvc-7                     1            <invalid>   0 OK / 3     Unknown     ResolvingDigests
ksvc-8-00001   ksvc-8                     1            <invalid>   0 OK / 0     <unknown>   <unknown>
ksvc-9-00001   ksvc-9                     1            <invalid>   0 OK / 4     Unknown     Deploying

I've read in an older issue that increasing the number of Controllers and Buckets could help, however increasing my Controllers and Buckets to 10 has not helped.

I replicated my setup on our on-premise cluster, as I was wondering if it could be something about our cloud infrastructure that was slowing things down, but I have seen the same results, with overall ready times averaging ~41s. Increasing Controllers and Buckets to 10 here actually slowed down the stack, increasing average overall ready time to 52.9s

How can I go about improving these numbers? and is the Kperf tool reliable still? its been in alpha for quite some time now.

Configs:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  high-availability:
    replicas: 3
  workloads:
  - name: controller
    replicas: 10
  services:
  - name: kourier
    annotations:
      networking.gke.io/load-balancer-type: "Internal"
  ingress:
    kourier:
      enabled: true
  config:
    deployment:
      registries-skipping-tag-resolving: "gcr.io"
      queue-sidecar-cpu-request: "100m"
      queue-sidecar-memory-request: "100Mi"
      queue-sidecar-cpu-limit: "250m"
      queue-sidecar-memory-limit: "250Mi"
    features:
      kubernetes.podspec-topologyspreadconstraints: "enabled"
      kubernetes.podspec-fieldref: "enabled"
    gc:
      retain-since-last-active-time: "1h"
      max-non-active-revisions: "1"
      min-non-active-revisions: "1"
    network:
      ingress-class: "kourier.ingress.networking.knative.dev"
    autoscaler:
      pod-autoscaler-class: "kpa.autoscaling.knative.dev"
      enable-scale-to-zero: "true"
    leader-election:
      buckets: "10"

The Operator is configs are the exact same ones for v1.10.2, with the difference that I've set resource requests/limits. The webhook requests 100m for both CPU and Mem, and is limited to 500m, while the Operator itself requests 500 and is limited to 2000 for both.

Please let me know if more information is required, and thank you for your time and patience.

Upvotes: 1

Views: 176

Answers (0)

Related Questions