Reputation: 201
We are using a Kafka stack to perform extraction, transformation, and loading (ETL) of a set of databases to another. We utilize Debezium for capturing data changes (CDC), Kafka Streams for data transformation, and Kafka Connect JDBC Sink for writing data to the target database. In our scenario, we have multiple source databases being monitored by Debezium.
Initially, we had only one topic with several partitions for each table in the JDBC Sink. This caused some issues since the data from each source database should be independently inserted into the destination database.
To address this problem, we decided to create a topic in the JDBC Sink for each source database and customize the JDBC Sink to use a regular expression (regex) for topic subscription. Additionally, we implemented logic to pause consumption from any topic that has inconsistent records. This approach of pausing consumption has been working well for us.
After deploying this new solution in production, we decided to test increasing the number of tasks in the JDBC Sink from one to ten. However, we encountered an issue: as the new model was designed to divide data into topics rather than partitions, only task 0 is subscribing to all topics, leaving the remaining tasks waiting.
We would like to know if there is a way to make Kafka Connect (Kafka Consumer) perform automatic rebalancing of topics. We didn't test subscribing by topic names instead of regular expressions. Could Kafka handle this approach better?
Upvotes: 0
Views: 189
Reputation: 191874
There is no rebalance API. Instead, you can deploy 10 connectors, with one task, and set consumer.override.group.id
to the same value for each sink. Otherwise, you are limited to autogenerated, separate connect-${name}
groups.
Then it'll operate as a standard consumer group. If one connector stops (which you can also monitor from /status endpoint with separate connectors), then it will rebalance.
Kafka Sink Connector with custom consumer-group name
Upvotes: 0