swagrov
swagrov

Reputation: 1460

Postgres replication slot from kafka-connect is filling up

The replication slot created by a kafka-connector connector is filling up.

I have a postgres RDS database on AWS. I put the following parameter group option on it (only showing the diff from default)

rds.logical_replication: 1

I have kafka connect running with a debezium postgres connector. This is the config (with certain values redacted, of course)

"database.dbname"        = "mydb"
"database.hostname"      = "myhostname"
"database.password"      = "mypass"
"database.port"          = "myport"
"database.server.name"   = "postgres"
"database.user"          = "myuser"
"database.whitelist"     = "my_database"
"include.schema.changes" = "false"
"plugin.name"            = "wal2json_streaming"
"slot.name"              = "my_slotname"
"snapshot.mode"          = "never"
"table.whitelist"        = "public.mytable"
"tombstones.on.delete"   = "false"
"transforms"             = "key"
"transforms.key.field"   = "id"
"transforms.key.type"    = "org.apache.kafka.connect.transforms.ExtractField$Key"

If I get the status of this connector, it appears to be fine.

curl -s http://my.kafkaconnect.url:kc_port/connectors/my-connector/status | jq

{
  "name": "my-connector",
  "connector": {
    "state": "RUNNING",
    "worker_id": "some_ip"
  },
  "tasks": [
    {
      "id": 0,
      "state": "RUNNING",
      "worker_id": "some_ip"
    }
  ],
  "type": "source"
}

However, the replication slot in postgres keeps getting larger and larger:

SELECT slot_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as confirmedLag,
  active
FROM pg_replication_slots;
           slot_name           | replicationslotlag | confirmedlag | active
-------------------------------+--------------------+--------------+--------
 my_slotname                   | 20 GB              | 20 GB        | t

Why does replication keep growing? As I understand, the kafka connect connector task that is running should be reading from this replication slot, publishing it to the topic postgres. public.mytable, and then the replication slot should decrease in size. Am I missing something in this chain of actions?

Upvotes: 4

Views: 5760

Answers (2)

Antriksh Vijay
Antriksh Vijay

Reputation: 21

Found this google group discussion where Gunnar mentions this -

The core heartbeat feature regularly emits messages to the heartbeat topic, allowing to acknowledge processed WAL offsets also in case only events in filtered tables occur (this is what you observe). Heartbeat action queries (which require the table and inclusion in publication) are useful to address situations with multiple databases, where the connector is receiving changes from one database with otherwise no/low traffic, again allowing to acknowledge offsets in this case.

- Gunnar

In the group discussion he mentions that we have to add this heartbeat table to the publication for this heartbeat query to work. This should help.

Upvotes: 0

Naros
Naros

Reputation: 21133

Please take a look at WAL Diskspace Consumption.

The most common reason why the PostgreSQL WAL backlogs is because the connector is monitoring a database or a subset of tables from your database change much more infrequent compared to the other tables or databases in your environment and therefore the connector isn't acknowledging the LSNs frequent enough to avoid the WAL backlog.

For Debezium 1.0.x and before, enable heartbeat.interval.ms.
For Debezium 1.1.0 and after, also consider enabling heartbeat.action.query as well.

Upvotes: 4

Related Questions