Reputation: 99921
I'm trying to produce some kind of materialized view from a stream of database updates (provided by e.g. the DBMS's transaction log, with the help of e.g. maxwell-daemon). The view is materialized as a Kafka compacted topic.
The view is a simple join and could be expressed as a query like this:
SELECT u.email user_email, t.title todo_title, t.state todo_state
FROM User u
JOIN Todo t
ON t.user_id = u.id
I want the view to be updated every time User or Todo change (a message to be published on the view's kafka topic).
With Kafka Streams it seems to be possible to achieve that by doing this:
However, I'm not sure of a few things:
Upvotes: 3
Views: 1842
Reputation: 62350
Yes. The pattern you describe will compute what you want out-of-the-box.
Streams will process data according to timestamps (ie, records with smaller timestamps first). Thus, in general this will work as expected. However, there is no strict guarantee because in stream processing it's more important to make progress all the time (and don't block). Thus, Streams only applies a "best effort approach" with regard to processing records in timestamp order. For example, if one changelog does not provide any data, Streams will just keep going only processing data from the other changelog (and not block). This might lead to "out of order" processing with regard to timestamps from different partitions/topics.
That's not possible at the moment. Each update will be processed individually and you will see each intermediate (ie, not committed) result. However, Kafka will introduce "transactional processing" in the future that will enable to handle transactions. (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging and https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics)
Upvotes: 2