Designing a Scalable Microservices Architecture for Healthcare Applications

I'm building a hospital management system that spans medical functionalities (e.g., RIS, LIS, PACS) and administrative domains such as billing, maintenance, inventory, pharmacy, and human resources. It’s intended to be an all-inclusive hospital management application.

Current Setup

  1. Frontend:
    • Built with Next.js for user interaction.
    • Sends HTTP requests to an API Gateway.
  2. Backend Gateway:
    • Developed using the NestJS framework.
    • Receives requests from the frontend and routes them to microservices using TCP message patterns.
  3. Microservices:
    • Each microservice is responsible for a specific functionality. Here's the list of my microservices:
      • Gateway Service: Handles routing of requests.
      • Auth Service: Manages authentication and security codes.
      • Communications Service: Sends emails and notifications.
      • Examinations Service: Manages medical examinations and test results.
      • Billing Service: Handles financial records and invoicing.
      • Pharmacy Service: Manages inventory and prescriptions for the hospital pharmacy.
      • Hospitalization Service: Manages patient admissions and room allocations.
      • Doctors Service: Handles doctor-related operations and schedules.
      • Patients Service: Manages patient data and interactions with external registries.
      • Personnel Service: Manages hospital staff records.
      • Users Service: Handles user management and roles.
      • Common Service: Provides reusable functionality and utilities shared by other services.

Challenges

  1. Inter-Service Dependencies:
    • Some endpoints in my microservices rely heavily on other microservices. For example:
      • The Auth Service uses the Communications Service to send emails for security codes.
      • If the Communications Service is down, parts of the Auth Service functionality are also unavailable.
    • This creates a situation where if one service (Z) fails, other dependent services (R) also fail, making the system behave like a monolithic application where everything is either up or down.
  2. Cascading Failures:
    • Tight coupling between services leads to cascading failures, where downtime in one service propagates to others.
  3. Fault Tolerance:
    • I need to ensure that service unavailability doesn’t affect unrelated functionalities. For instance:
      • If the Communications Service is down, users should still be able to authenticate, even if email notifications are delayed or unavailable.
  4. Why Microservices?
    • My primary goal is to segment functionality to reduce redundancy and duplication across the system. However, the current implementation seems to lack independence, resulting in cascading issues.

Questions

  1. How can I decouple inter-service dependencies while maintaining fault tolerance?
  2. Would an event-driven architecture or patterns like Saga/Choreography help manage service interactions and reduce cascading failures? If so, what’s the best approach for implementation?
  3. How can I handle service unavailability gracefully, especially when some functionalities (like email notifications) are non-critical for the workflow?
  4. Are there best practices for refactoring a tightly coupled microservices system without regressing to a monolithic design?

I’d greatly appreciate any advice on how to improve the architecture to achieve scalability, fault tolerance, and maintain the benefits of segmentation with minimal duplication. Let me know if more details are needed!

Upvotes: 0

Views: 61

Answers (1)

Christophe Quintard
Christophe Quintard

Reputation: 2743

First, I will nitpick a little bit : you are doing service-oriented architecture (SOA), not microservices.

Now about your system, my first advice is to not ask question about your SOA that you would not ask about a monolithic architecture.

Let take the communication service as an example. If you were building a monolith, you would still have to worry about mails or notifications not being sent, but you would not worry about the other modules not being able to reach the communication module. I recommend you do the same with your SOA.

Elaborate scenarios for the case you cannot send communications for reasons outside your responsibility : your mail provider is shutdown, the network has a failure, ... You can imagine a downgraded scenario, you can use two different providers to minimize the risks, or you can choose to live with it because the odds are low.

When doing SOA, there is a risk a service cannot reach another service, which do not exist in monolith. But there are plenty of ways to ensure that it will not happen :

  • run several instances of the communication service,
  • run instances of the communication service in different clusters / zones / countries,
  • have the other services to periodically check if they can reach the communication service to detect failures as soon as possible (checking that A is up and B is up is not the same as checking if A can reach B).

If you run all your services in Kubernetes, chances are low that services can't communicate together. In a node or the network goes down, the cluster removes the faulty or unreachable node from the cluster, eventually starts new pods to compensate the lost ones, and your system as a whole will continue to run just the same (service availability > server availability).

If you allow me to give you an advice, pay great attention to "transactions". Do not use distributed transaction manager, that is hell. I suggest :

  • Use API calls, but be ready to fail the entire request if some parts fails (sometimes this is impossible to achieve).
  • Communicate through messages. Messages can be retried, combine a at-least-one mode with a idempotent approach to obtain an exactly-once mode that makes your system reliable.
  • Use declarative instead of imperative. If you try to process a command that requires three distinct operations, it is hard to ensure that you will be either doing the three operations or none. If you save a demand, then you are able to try to do the three operations again and again until you have finally complete all three, then mark the demand as complete. And if you want to rollback, the demand contains the operations that have to be undone, and once again you can try several times. Or a human can do it.

Hope this helps, good luck with your project, this is really ambitious !

Upvotes: 1

Related Questions