alexanoid
alexanoid

Reputation: 25852

Debezium initial data snapshot and related entities order

After the first launch Debezium will do initial data snapshot of the already existing data.

Let's say I have two tables - A and B. Table B have NOT NULL FK constraint on A. According to Debezium default approach - Debezium will create two separate Kafka topics for data from tables A and B.

In my understanding, there is a very big chance that I'll potentially try to create record in new table B while appropriate record A will not be present in the appropriate new table A. This way I'll run into constraint violation error.

Do I need to use some internal 3rd party buffer and organize the proper order of insert into the sink database by myself or there is some standard mechanism in Debezium in order to handle such situations?

For example - can I use Debezium Topic Routing https://debezium.io/documentation/reference/configuration/topic-routing.html in order to fix such issue? I can potentially configure Topic Routing to send all depended events (from tables A and B in my example above) to the same topic. In case of the Kafka topic with a single partition all events must be ordered in a correct way. Will it work and this way will I have a correct related entities order for initial snapshot data load?

Upvotes: 0

Views: 1066

Answers (1)

Shawn
Shawn

Reputation: 288

The IBM IDR (Data Replication) Product solved this with a solution that allows for exactly once semantics and re-creates the ordering of operations within a transaction and ordering of transactions.

Kafka's built in exactly once features has some limitations beyond performance, you don't inherently get the transaction re-ordered by operation, which is important for things like applying with referential integrity constraints.

So in our product we have a proper and a poor man's way to solve the problem. The poor man's is to send all the data for all the tables to a single topic. Obviously this is sub-optimal, but our product will produce data in operation order from a single producer if you do this. You'd probably want idempotence to avoid batches showing up out of order.

Now the pro-level way to solve this is a feature called the TCC (Transactionally Consistent Consumer).

I'm not sure if you need an enterprise level solution performance and feature wise.

If this is a non-critical project you might find the following discussion useful in how we approach delivering the features your looking for.

https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/

And here's our docs on the feature for reference.

https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/concepts/kafkatcc.html

That should give background as to why this problem is hard to solve and what goes into a solution hopefully.

Upvotes: 1

Related Questions