What platform/tech stack can help achieve seamless distributed logging and tracing for my system?
I’m looking for recommendations on platforms or tech stacks that can help us achieve robust distributed logging and tracing for our platform. Here's an overview of our system and requirements:
Our Platform
We have a distributed system with the following interconnected components:
- Web App built using Next.js:
- Frontend: React
- Backend: Node.js
- REST API Server using FastAPI.
- Python Library that runs on client machines and interacts with the REST API server.
What We Want to Achieve
When users report issues, we need a setup that can:
- Trace user activities across all components of the platform.
- Correlate logs from different parts of the system to identify and troubleshoot the root cause of an issue.
For example, if a user encounters a REST API error while using our Python library, we want to trace the entire flow of that request across the Python library, REST API server, and any related services.
Specific Questions
Tracking User Actions Across the Platform
- Are there any tools or platforms that can trace a user’s journey or timeline of activities across multiple applications?
- Can these tools link logs/errors to an individual user’s actions across the system?
Handling Guest Users and Identity Mapping
- Many users interact as guests (anonymous) before authenticating. Is there a way to associate logs/errors from their guest activities to their identity once they log in, so all their past and future actions are unified under a single identity?
Unifying Logs Across the Platform
Here’s an example scenario we’re looking to address:
- A user runs code in the Python library, and we log their actions.
- The library prompts them to log in, and we log this event as well.
- During login, they hit a REST API endpoint, but the login fails (e.g., an authentication error). Logs are captured from both the library and the API server.
- Upon successful login, we assign an identity to the user and tag all their future logs with it.
- The user uploads a file via the Python library, but the REST API server throws an error. Logs from both the library and the API server should be correlated.
- Later, if the user reports an issue, we want to trace their actions across the entire platform to identify the root cause.
Filtering Logs for Troubleshooting
- Are there solutions that allow filtering logs/traces for a specific user or session to recreate the sequence of events leading to an issue?
What We Are Considering
Are there platforms, open-source tools, or tech stack setups (commercial or otherwise) that you’d recommend for this?
We’re essentially looking for a distributed logging and tracing solution that can help us achieve this level of traceability and troubleshooting across our platform.
Would love to hear about what has worked for you or any recommendations you might have!
Thanks in advance!