MaatDeamon
MaatDeamon

Reputation: 9771

Join table with Kafka stream / KSQL?

I'm importing a DB, which contains some link table representing both many-to-many and one-to-many relationship.

Let's focus for now on the one-to-many relationship. E.g. A Biossay can have many document but a document can only have one BioAssay.

Hence I have a table of BioAssay [BioAssay, ..., ..., ...] and a link table [Document, BioAssay].

Ultimately I need to join those 2 into a full BioAssay with all its document e.g. [BioAssayxyz, ...., "Document1:Document2:Document3"]

I would like to learn what needs to happen with Kafka stream.

1 - So far based on my understanding of Kafka stream, it seems that I need a stream for each link table, in order to perform the aggregation. KTable would not be usable because, records are updated per key. The result of the aggregation could be a in a Ktable however.

2 - Then comes the problem of join on foreign keys. It seems the only way to do that is through GlobalKtable. link-table-topic -> link-table-stream->link-tableGlobaKTable. This could result in a lot disk space usage as my table are very large. This is a very large DB with a lot of table, and that requirement of building several logical view on the data is part of the core of the project and can't be avoided.

a) Am I understanding it correctly here ?

b) Is this the only way to tackle that ?

Edit

Sounds like the only thing that exist is KStream-to-GlobalKTable, seems like I need to turn things upside down a little. My original DB BioAssay Table, needs to be turned into a stream, while my link document table, need to be turn into a stream first for aggregation and then a GlobalKTable for joining.

Either way, unless my streams only have one partition, this can be very expensive.

Upvotes: 0

Views: 1202

Answers (2)

aishwarya kumar
aishwarya kumar

Reputation: 69

If your tables share the same key (i.e. Foreign Key) then you can use this to your advantage and stream all your tables into the same topic (you can use multiple partitions to scale out).

Upvotes: 0

Svend
Svend

Reputation: 7218

I happened to have worked on a similar use case with Kafka Streams a few month ago, I'm happy to share my learnings.

Using KStreams-to-KTable as you suggest would kinda work, although with some caveats that might not be acceptable to you.

First, recall that a stream-to-table join is only updated by Kafka Streams when a new event is received on the stream side, not on the ktable side.

Second, assuming you're using CDC in order to import the DB, then my understanding is that you do not have guarantees on the order in which updates lands on Kafka. That means that even if you enjoy transaction isolation on DB side that make appear an update or insert on tables Document and BioAssay "all at once", on Kafka side you'd receive one, and then the other, in arbitrary order.

The two points above hopefully make clear why the join result on Kafka Streams side might not reflect the DB content as you'd expect.

The solution I took was to go "under the hood" and join my streams manually using the Processor API. This allowed to achieve table-to-table join semantic, updated whenever either side is updated. I described the core idea in that blog post:

https://svend.kelesia.com/one-to-many-kafka-streams-ktable-join.html

Using that technique, I was able to import correctly both one-to-many and many-to-many relationships from DB.

Upvotes: 3

Related Questions