Reputation: 71
We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
SinkFunction
is not rich and has no state.Eventually this situation resolves one of 2 ways:
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
Upvotes: 1
Views: 1253
Reputation: 940
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Upvotes: 0
Reputation: 43454
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
Enabling RocksDB native metrics might provide some insight.
Upvotes: 1