ha9u63a7
ha9u63a7

Reputation: 6854

Apache Kafka Connect/Streams API for synchronising database tables

I was reading through Kafka documentation on Connect API and trying to relate that to my problem domain. I have multiple databases which has common tables that I need to synchronise on any updates/inserts/deletes. An exmaple is this:

1) Someone updates table "order_history" in DB1 - I want the update to be communicated to DB2/DB3 etc.

2) Someone inserts a record into "purchase_order" - I want the insert to be communicated to DB2/DB3 etc. so that the same insert happens in those DB2/DB3 etc.

3) The tables will be in all the DBs - so no missing table issue there.

These are only to be done on a specific set of tables, not the entire database. What I understand from Connect API documentation is that I need to provide the following:

1) Source Connector imports data - from SQL/File system to Kafka topics

2) Sink Connector exports data - from kafka topics to SQL/File system/Hadoop FS

But then I am trying to understand how this is relevant to syncing multiple database tables on any inserts/updates/deletes - because Connect API still involves write/read on the topics - which might not necessarily what my use case is. I have also looked at Kafka Streams but it seems like an efficient tool when it comes to doing data aggregation and counter management, again not probably my use case.

Could anyone explain whether my assumption is correct, and I should still explore Streams/Connect API?

Regards,

Upvotes: 0

Views: 2220

Answers (1)

Robin Moffatt
Robin Moffatt

Reputation: 32110

Yes you can use Kafka Connect to apply changes from one database to another. You would typically use a CDC tool to take the events directly from the redo/transaction log on your source database, which pushes each event to a Kafka topic. An example would be Oracle GoldenGate, or the Debezium project.

Once on a Kafka topic, you can then use Kafka Connect's JDBC Sink to push these changes to a target database.

Where this may not meet your requirement is if you want to also mirror deletes directly in your target, since usually CDC records will have a column indicating the operation (e.g. "D" for delete) and you will get a row inserted on the target with this value.

However, if you are looking to literally mirror a set of tables from one DB to another DB, you should be looking at a database replication tool, not Kafka.

Where Kafka fits is if you want to stream events from one place to another (and want to store delete events, than delete them from the target), with the option of using that same data to land to other targets, or drive other direct applications. This could be Kafka Streams, a Kafka Consumer - or any of the other multitude of technologies and tools out there that integrate with Kafka.

In terms of Kafka Streams this would be useful if you want to do some processing on the data you're pulling in from your source database - for example, joins/filtering/aggregation. As well as writing Java code directly with Kafka Streams, you now have the option of using a SQL-like interface on top of Kafka, with KSQL from Confluent.

Upvotes: 2

Related Questions