Flink job requiring a lot of memory despite using rocksdb state backend

Question

I have a job running on Flink 1.14.3 (Java 11) that uses rocksdb as the state backend. The problem is that the job requires an amount of memory pretty similar to the overall state size.

Indeed, for making it stable (and capable of taking snapshots) this is what I'm using:

4 TMs with 30 GB of RAM and 7 CPUs
Everything is run on top of Kubernetes on AWS using nodes with 32 GB of RAM and locally attached SSD disks (M5ad instances for what it's worth)

I have these settings in place:

state.backend: rocksdb
state.backend.incremental: 'true'
state.backend.rocksdb.localdir: /opt/flink/rocksdb <-- SSD volume (see below)
state.backend.rocksdb.memory.managed: 'true'
state.backend.rocksdb.predefined-options: FLASH_SSD_OPTIMIZED
taskmanager.memory.managed.fraction: '0.9'
taskmanager.memory.framework.off-heap.size: 512mb
taskmanager.numberOfTaskSlots: '4' (parallelism: 16)

Also this:

  - name: rocksdb-volume
    volume:
      emptyDir:
        sizeLimit: 100Gi
      name: rocksdb-volume
    volumeMount:
      mountPath: /opt/flink/rocksdb

Which provides plenty of disk space for each task manager. With those settings, the job runs smoothly and in particular there is a relatively big memory margin. Problem is that memory consumption slowly increases and also that with less memory margin snapshots fail. I have tried reducing the number of taskmanagers but I need 4. Same with the amount of RAM, I have tried giving e.g. 16 GB instead of 30 GB but same problem. Another setting that has worked for us is using 8 TMs each with 16 GB of RAM, but again, this leads to the same amount of memory overall as the current settings. Even with that amount of memory, I can see that memory keeps growing and will probably lead to a bad end...

Also, the latest snapshot took around 120 GBs, so as you can see I am using an amount of RAM similar to the size of the total state, which defeats the whole purpose of using a disk-based state backend (rocksdb) plus local SSDs.

Is there an effective way of limiting the memory that rocksdb takes (to that available on the running pods)? Nothing I have found/tried out so far has worked. Theoretically, the images I am using have jemalloc in place for memory allocation, which should avoid memory fragmentation issues observed with malloc in the past.

UPDATE 1: Attached please find memory evolution with `taskmanager.memory.managed.fraction' equal to 0.25 and 0.1. Apparently, the job continues to require all the available memory in the long run.

UPDATE 2: Following David's suggestion I've tried lowering the value of taskmanager.memory.managed.fraction as well as the total amount of memory. In particular, it seems that the job can run smoothly with 8 GB per TM if the managed fraction is set to 0.2. However, if set to 0.9 the job fails to start (due to lack of memory) unless 30 GB per TM are given. The following screenshot displays the memory evolution with the managed fraction set to 0.2 and the TM memory set to 8 GB. At around 13.50h a significant amount of memory was freed as a result of taking a snapshot (which worked well). Overall the job looks pretty stable now...

David Anderson · Accepted Answer

RocksDB is designed to use all of the memory you give it access to -- so if it can fit all of your state in memory, it will. And given that you've increased taskmanager.memory.managed.fraction from 0.4 to 0.9, it's not surprising that your overall memory usage approaches its limit over time.

If you give RocksDB rather less memory, it should cope. Have you tried that?

Flink job requiring a lot of memory despite using rocksdb state backend

Answers (2)

Related Questions