Reputation: 4320

Seeding microservices databases

Given service A (CMS) that controls a model (Product, let's assume the only fields that it has are id, title, price) and services B (Shipping) and C (Emails) that have to display given model what should the approach be to synchronize given model information across those services in event sourcing approach? Let's assume that product catalog rarely changes (but does change) and that there are admins that can access data of shipments and emails very often (example functionalities are: B:display titles of products the order contained and C:display content of email about shipping that is going to be sent). Each of the services has their own DB.

Solution 1

Send all required information about Product within event - this means following structure for order_placed:

{
    order_id: [guid],
    product: {
        id: [guid],
        title: 'Foo',
        price: 1000
    }
}

On service B and C product information is stored in product JSON attribute on orders table

As such, to display necessary information only data retrieved from the event is used

Problems: depending upon what other information needs to be presented in B and C, amount of data in event can grow. B and C might not require the same information about Product, but the event will have to contain both (unless we separate the events into two). If given data is not present within given event, code can not use it - if we'll add a color option to given Product, for existing orders in B and C, given product will be colorless unless we update the events and then rerun them.

Solution 2

Send only guid of product within event - this means following structure for order_placed:

{
    order_id: [guid],
    product_id: [guid]
}

On services B and C product information is stored in product_id attribute on orders table

Product information is retrieved by services B and C when required by performing an API call to A/product/[guid] endpoint

Problems: this makes B and C dependant upon A (at all times). If schema of Product changes on A, changes have to be done on all services that depend on them (suddenly)

Solution 3

Send only guid of product within event - this means following structure for order_placed:

{
    order_id: [guid],
    product_id: [guid]
}

On services B and C product information is stored in products table; there's still product_id on orders table, but there's replication of products data between A, B and C; B and C might contain different information about Product than A

Product information is seeded when services B and C are created and are updated whenever information about Products changes by making call to A/product endpoint (that displays required information of all products) or by performing a direct DB access to A and copying necessary product information required for given service.

Problems: this makes B and C dependant upon A (when seeding). If schema of Product changes on A, changes have to be done on all services that depend on them (when seeding)

From my understanding, the correct approach would be to go with solution 1, and either update events history per certain logic (if Product catalog hasn't changed and we want to add color to be displayed, we can safely update history to get current state of Products and fill missing data within the events) or cater for nonexistence of given data (if Product catalog has changed and we want to add color to be displayed, we can't be sure if at that point in time in the past given Product had a color or not - we can assume that all Products in previous catalog were black and cater for by updating events or code)

Upvotes: 12

Answers (4)

Sudhir

Reputation: 1419

It is very hard to simply say one solution is better than the other. Choosing one among Solution #2 and #3 depends on other factors (cache duration, consistency tolerance, ...)

My 2 cents:

Cache invalidation might be hard but the problem statement mentions that product catalog change rarely. This fact make product data a good candidate for caching

Solution #1 (NOK)

Data is duplicated across multiple systems

Solution #2 (OK)

Offers strong consistency
Works only when product service is highly available and offers good performance
If email service prepares a summary (with lot of products), then the overall response time could be longer

Solution #3 (Complex but preferred)

Prefer API approach instead of direct DB access to retrieve product information
Resilient - consuming services are not impacted when product service is down
Consuming applications (shipping and email services) retrieve product details immediately after an event is published. The possibility of product service going down within these few milliseconds is very remote.

Upvotes: 2

Phil Sandler

Reputation: 28016

There are two hard things in Computer Science, and one of them is cache invalidation.

Solution 2 is absolutely my default position, and you should generally only consider implementing caching if you run into one of the following scenarios:

The API call to Service A is causing performance problems.
The cost of Service A being down and being unable to retrieve the data is significant to the business.

Performance problems are really the main driver. There are many ways of solving #2 that don't involve caching, like ensuring Service A is highly available.

Caching adds significant complexity to a system, and can create edge cases that are hard to reason about, and bugs that are very hard to replicate. You also have to mitigate the risk of providing stale data when newer data exists, which can be much worse from a business perspective than (for example) displaying a message that "Service A is down--please try again later."

From this excellent article by Udi Dahan:

These dependencies creep up on you slowly, tying your shoelaces together, gradually slowing down the pace of development, undermining the stability of your codebase where changes to one part of the system break other parts. It’s a slow death by a thousand cuts, and as a result nobody is exactly sure what big decision we made that caused everything to go so bad.

Also, If you need point-in-time querying of product data, this should be handled in the way the data is stored in the Product database (e.g. start/end dates), should be clearly exposed in the API (effective date needs to be an input for the API call to query the data).

Upvotes: 2

Savvas Kleanthous

Reputation: 2745

Generally speaking, I'd strongly recommend against option 2 because of the temporal coupling between those two service (unless communication between these services is super stable, and not very frequent). Temporal coupling is what you describe as this makes B and C dependant upon A (at all times), and means that if A is down or unreachable from B or C, B and C cannot fulfill their function.

I personally believe that both options 1 and 3 have situations where they are valid options.

If the communication between A and B & C is so high, or the amount of data needed to go into the event is large enough to make it a concern, then option 3 is the best option, because the burden on the network is much lower, and latency of operations will decrease as the message size decreases. Other concerns to consider here are:

Stability of contract: if the contract of message leaving A changed often, then putting a lot of properties in the message would result in lots of changes in consumers. However, in this case I believe this to not be a big problem because:
1. You mentioned that system A is a CMS. This means that you're working on a stable domain and as such I don't believe you'll be seeing frequent changes
2. Since the B and C are shipping and email, and you're receiving data from A, I believe you'll be experiencing additive changes instead of breaking ones, which are safe to add whenever you discover them with no rework.
Coupling: There is very little to no coupling here. First since the communication is via messages, there is no coupling between the services other than a short temporal one during seeding of the data, and the contract of that operation (which is not a coupling you can or should try to avoid)

Option 1 is not something I'd dismiss though. There is the same amount of coupling, but development-wise it should be easy to do (no need for special actions), and stability of the domain should mean that these won't change often (as I mentioned already).

Another option I'd suggest is a slight variation to 3, which is not to run the process during start-up, but instead observe a "ProductAdded and "ProductDetailsChanged" event on B and C, wheneve there is a change in the product catalogue in A. This would make your deployments faster (and so easier to fix a problem/bug if you find any).

Edit 2020-03-03

I have a specific order of priorities when determining the integration approach:

What is the cost of consistency? Can we accept some milliseconds of inconsistency between things changed in A and them being reflected in B & C?
Do you need point-in-time queries (also called temporal queries)?
Is there any source of truth for the data? A service which owns them and is considered upstream?
If there is an owner / single source of truth is that stable? Or do we expect to see frequent breaking changes?

If the cost of inconsistency is high, (basically the product data in A need to be consistent as soon as possible with product cached in B and C), then youb cannot avoid needing to accept unavaibility, and make a synchronous request (like a web/rest request) from B & C to A to fetch the data. Be aware! This still does not mean transactionally consistent, but just minimizes the windows for inconsistency. If you absolutely, positively have to be immediately consistent, you need to rething your service boundaries. However, I very strongly believe this should not be a problem. From experience, it's actually extremely rare that the company can't accept some seconds of inconsistency, so you shouldn't even need to make synchronous requests.

If you do need point-in-time queries (which I didn't notice in your question and hence didn't include above, maybe wrongly), the cost of maintaining this on downstream services is so high (you'd need to duplicate internal event projection logic in all downstream services) that makes the decision clear: you should leave ownership to A, and query A ad-hoc over web request (or similar), and A should use event sourcing to retrieve all the events you knew about at the time to project to the state, and return it. I guess this may be option 2 (if I understood correctly?), but the costs are such that while temporal coupling is better than maintainance cost of duplciated events and projection logic.

If you don't need a point in time, and there isn't a clear, single owner of the data (which in my initial answer I did assume this based on your question), then a very reasonable pattern would be to hold representations of the product in each service separately. When you update the data for products, you update A, B and C in parallel by making parallel web requests to each one, or you have a command API which send multiple commands to each of A, B and C. B & C use their local version of the data to do their job, which may or may not be stale. This isn't any of the options above (although it could be made to be close to option 3), as data in A, B and C may differ, and the "whole" of the product may be a composition of all three data sources.

Knowing if the source of truth is has a stable contract is useful because you can use it to use the domain/internal events (or events you store in your event sourcing as storage pattern in A) for integration across A and services B and C. If the contract is stable you can integrate through the domain events. However, then you have an additional concern in the case where changes are frequent, or that contract of message is large enough that make transport a concern.

If you have a clear owner, with a contrac that is expected to be stable, the best options would be option 1; an order would contain all necessary information and then B and C would do their function using the data in the event.

If the contract is liable to change, or break often, following your option 3, that is falling back to web requests to fetch product data is actually a better option, since it's a much easier to maintain multiple versions. So B would make a request on v3 of product.

Upvotes: 2

VoiceOfUnreason

Reputation: 57257

Solution #3 is really close to the right idea.

A way to think about this: B and C are each caching "local" copies of the data that they need. Messages processed at B (and likewise at C) use the locally cached information. Likewise, reports are produced using the locally cached information.

The data is replicated from the source to the caches via a stable API. B and C don't even need to be using the same API - they use whatever fetch protocol is appropriate for their needs. In effect, we define a contract -- protocol and message schema -- which constrain the provider and the consumer. Then any consumer for that contract can be connected to any supplier. Backward incompatible changes require a new contract.

Services choose the appropriate cache invalidation strategy for their needs. This might mean pulling changes from the source on a regular schedule, or in response to a notification that things may have changed, or even "on demand" -- acting as a read through cache, falling back to the stored copy of the data when the source is not available.

This gives you "autonomy", in the sense that B and C can continue to deliver business value when A is temporarily unavailable.

Recommended reading: Data on the Outside, Data on the Inside, Pat Helland 2005.

Upvotes: 5

Seeding microservices databases

Answers (4)

Related Questions