Yun Xing
Yun Xing

Reputation: 75

How to auto scale up/down Flink Stateful Functions on K8s

My Current Flink Application

My Objectives

I want to auto scale up/down the stateful functions. I also want to know how to create more standby job managers

My Observations about the HA

I tried to set kubernetes.jobmanager.replicas in the flink-config ConfigMap:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: flink-config
  labels:
    app: shadow-fn
data:
  flink-conf.yaml: |+
    kubernetes.jobmanager.replicas: 7
    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory

I see no standby job managers in K8s.

Then I directly adjust the replicas of deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: statefun-master
spec:
  replicas: 7

Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

What's wrong with my approach?

My Questions about the scaling

Reactive Mode is exactly what I need. I tried but failed, job manager has error message:

Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).

It seems that stateful function auto scaling shouldn't be done in this way. What's the correct way to do the auto scaling, then?

Potential Approach(Probably incorrect)

After some research, my current direction is:

  1. Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
  2. My stateful functions are Flink remote services, i.e., they are regular k8s services. they can be deployed in form of KNative service to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker
  3. The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use KNative to deploy the Flink worker? If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.

At the last

I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!

Upvotes: 0

Views: 788

Answers (1)

ChangLi
ChangLi

Reputation: 823

Since Reactive Mode is a new, experimental feature, not all features supported by the default scheduler are also available with Reactive Mode (and its adaptive scheduler). The Flink community is working on addressing these limitations.

https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/

Upvotes: 0

Related Questions