Mike Rowe Servis
Mike Rowe Servis

Reputation: 81

Microservices and the "single point of failure" concept

One concept I don't entirely understand is the single point of failure. It seems to me that whenever you have multiple services, say A, B and C, involved in an entire system, then if any of them is down the system as a whole can't do anything that useful (If the system could be useful without B, then why is B even needed in the first place?).

For example, let's say we have a pipeline such that A publishes an event that is consumed by B and then B publishes a message that is consumed by C and this flow of data is how the whole system serves its purpose.

A ===> B ===> C

Maybe C is the service that processes credit card information: the business isn't really running if no money is coming in!

Since this is a messaging system, these services are "independent" in the sense that if one goes down it does not cause another to go down. Ok, but if B goes down then C won't receive any new messages and the entire system isn't serving it's purpose. So, what difference does it make having separate services A, B and C rather than one service ABC?

Upvotes: 7

Views: 4287

Answers (5)

Christian Findlay
Christian Findlay

Reputation: 7682

I think your question is rhetorical. Obviously, if the system depends on all services, then any service is a single point of failure. If a single service goes down, the system won't be "serving it's purpose". Embracing Microservices won't automatically liberate you from the problem of single point of failure.

Most proponents of Microservices will tell you that you should design your system in such a way that the whole system should not depend on any one service. But, such a system just sounds like a unicorn to me. It's the same thing as saying "if you delete a chunk of your code, the app should keep working"

In reality, you can design a system where there is some utility left if any one service goes down. But, I doubt that there is any system that can function properly when one of its important components is missing. When a system does function properly without one of its components, the amount of extra error checking and so on that's required is horrendous.

But the thing is, that's not what Microservices are designed for. That's just one of the supposed benefits that people tout. The benefit only comes if you design your system to allow for the failure. But, you don't need to use Microservices to do that anyway.

Making occasionally connected clients can be another way to avoid single point of failure. Git is a good example. If GitHub goes down, you don't have people sitting around saying "oh, looks like I won't have to do any work today then".

Note: a load balancer can be thrown in front of any service so that the physical machine doesn't become the single point of failure.

Upvotes: 1

James Russell
James Russell

Reputation: 66

Service composition is one of the most difficult parts of microservices. Without reading a few books on this, here are a few guidelines. If you are looking for some of the benefits below, breaking out into an independent service might make sense:

  1. Re-use logic. In your example, if service C is also called by other services: D -> E -> C. If you suspect your logic could or should be consumed by other services, creating service C as independent means you can serve customers of that logic even when A and B are down.
  2. De-couple teams. If you are a small team, you probably don't want hundreds of services. But, write your logic so that you can separate it later. It's good to have a "microservice" mindset even if you don't break your logic into independent running services in the beginning.
  3. Ensure logic separation. It's easy to cheat when your code is in a monolith. If you really need to force yourself or your team to think about two things as "separate" so that you can re-use it, another service can force this.
  4. Optimize for execution patterns. If your system has to respond synchronously, do work asynchronously, deal with huge surges of inbound floods of work (from 0 rps to 10,000 rps in minutes) you may want to separate out services. A lightweight service that only takes a REST API call and queues it for actual work might be exactly what you need to handle inbound floods of work where you can afford to process or respond asynchronously. The lightweight service can spin up in milliseconds providing quick response to erratic demand. If you have a 12GB Java beast, scaling up could take a while.

I would also recommend that you choose your datastore wisely. A lot of time can be spent optimizing reliability of the coded services and the accompanying infrastructure, but you can still have a single-point-of-failure in your database architecture (or network or load balancer or dns or...)

Upvotes: 0

arunvg
arunvg

Reputation: 1269

Slightly modify the system and add redundancy.

[(A)(A)(A)] ===> [(B)(B)(B)] ===> [(C)(C)(C)]

Now even if one of the replicated services say (B) goes down the user story would get completed due to the availability of clone (B) nodes.

This system (in this scope) doesn't have a single point of failure.

Point to note, your design used messaging or essentially "loose coupled" it was very easy to modify the system and remove failure points.

There are other aspects of microservices which would need a detailed discussion. A prespective which helped me to understand the concepts in aligned to microservices is the Scale cube model.

Upvotes: 1

Rob Conklin
Rob Conklin

Reputation: 9446

Different parts of the service have different online-capacity needs. Failure mode analysis is critical to really understand where you need to separate services and make them more resilient. For example, perhaps C is not useful to separate if it isn't optional to the workflow for ordering, but because it is so important, it should get it's own additional resiliency (multiple fail-over workers on multiple hosts).

If, on the other hand, C were a fulfillment system (sending the pick-ticket to the warehouse), it wouldn't need that level of resiliency, and could afford to go down. It's about deciding where your failure points are, and how much it's worth to prevent those failures.

In addition to failure modes, there are capacity issues to consider. Credit-card processing may have completely different scaling needs than an inventory listing service. Perhaps customers are asking prices on a VERY frequent basis, and as such, you may need to support much more capacity than for the credit-card processing service. As such you need to build more scaling capacity for that part of your service. Also, a failure in that service may be more acceptable than a failure in an actual order processing service (revenue is likely vs speculative). Regardless, you need to understand the value of each of these services, and find ways to split them that allow you to scale their capacity and resiliency independently.

Upvotes: 0

Tobin
Tobin

Reputation: 2018

Think of a service ('B') as a collection of parallel roads, or processing channels. Once these roads are built to a design (the code) they will sit there operating. The design doesn't change, so the processing doesn't change, and it does as you say. However, consider a road develops a non-design fault - a hardware failure. The road surface is physically un-passable. Traffic cannot flow, but luckily we have many parallel roads which can absorb this traffic! If we were to only have 1 (wide) road, the whole road has been shut for resurfacing so no traffic can flow.

You can take this further. Imagine traffic on your parallel roads is increasing and the roads are at capacity. It is easy to build another single lane road. This isn't much, but once it is built you can allow it to operate at maximum capacity. But the land rent costs money! So when traffic is reduced, we can easily decommission the small road and not pay rent on it.

You can take this even further - say you come up with a new road design, so you build it next to the existing roads. You can allow traffic onto this road and test how it operates. If there is an unknown error in your design, some traffic might get lost. But most of the traffic can go through your existing good roads. Now we can either change the design, or keep it and slowly change each small road into the new design.

Upvotes: 0

Related Questions