Reputation: 85
I learned about using Kafka's topics as a changelog to avoid doing synchronous RPC, but I don't understand how we deal with consistency as topics are not persistent (retention policy).
i.e, I run an application, 2 microservices:
Each service has its own db to persist the Users' data.. To communicate any changes made on a User's data, the confluent's teacher proposed to create a topic and use it as a changelog. User Service inputs the changes, other microservices can consume.
But What if:
The BillingService won't know the User X's address, so its view is inconsistent. Should I run a one-time "Call UserService to copy its full DB" when a new service enters the system? Seems ugly solution.
More tricky and challenging:
Therefore, it will potentially miss some changelogs. How do we deal with that? We are never confident how the service knows everything it has to know about the users.
Did some research, but found nothing. I really think I don't have enough vocabulary yet to do good research, as the problem sounds pretty common to everyone. Sorry if it exists a source dedicated to this problem that I did not find!
Upvotes: 0
Views: 375
Reputation: 1730
This is quite common issue in replication - a node goes offline for a significant amount of time. For example, a node's hardware completely failed/lost and it takes weeks to order/get new one.
In that case, in distributed systems, we don't do fail recovery, but we provision a new node as a replacement. That new node is completely empty, hence it needs some initial state.
If your queue has all events since the beginning of time, you could apply those events one by one to the node - that would do the job - but in a very inefficient way (imagine processing years of data).
There is a better process - first restore data for the new node from the most recent backup, and then reapply newer items.
Backing up data is important. Every Microservices should do its own job saving/restoring its data. As a result, the original Kafka system won't have to keep data forever.
As a quick summary: in distributed replication these are two different problems - catching up a lagging node and provisioning a new node. And if a node is lagging for a long time, then this becomes provisioning problem.
Upvotes: 0
Reputation: 20551
If the changelog topic is covering entities that are of unbounded lifetime (like your users, hopefully), that strongly suggests that the retention period for that topic should be infinite. Chances are that topic is sufficiently low volume that infinite retention is viable (consider that it can probably be partitioned).
If for some reason that's not viable, you can arrange for producers to at some period shorter than the retention period publish out "this is the state of this entity" for every entity they own to the topic. For entities which don't change very much, this is pretty wasteful and duplicative (but for those a very long to infinite retention period is more viable), for entities which do change a lot, this is a rounding error in terms of volume.
That neatly solves the first case and eventually allows for the second to be solved. For the second, there is basically no solution, which means that you have to choose the retention period for a topic such that you can guarantee that no consumer of this topic will ever be down (or not deployed) for longer than the retention period: this typically means that a retention period shorter than, say, 7 days, should be really heavily scrutinized. Note that if you have a 1 week retention period and a consumer has been down for more than a few days, you can temporarily bump up the retention period to buy you time for the consumer to get fixed, and if there's a consumer which can be down for more than a week without anybody noticing, how important is that consumer, really?
Upvotes: 1