Reputation: 896
I was wondering about possible ways to track down performance bottlenecks in distributed systems. I am aware of tools like X-Trace and its offspring (e.g. Dapper) but I am more curious about the methodology rather than specific tools.
In other words, given a distributed system without any obvious bottlenecks, how do you study and improve its performance?
Upvotes: 5
Views: 1778
Reputation: 626
Honestly, that's a great question, and there isn't a consensus on what's the best way to do this. One of the most basic ways is logging, where you basically just dump a bunch of system events into a file and you can parse those logs to find the timing between events to figure out how long they take. Another approach is tracing (which is used by Xtrace). In tracing, you keep track of the lifetime of a request. For example, if you send a request to a service that uses a microservice architecture, you will keep track of the thread, process ID, and latency of the request as it goes through the microservices of the system.
The tricky part is figuring out what you want to keep track in the trace of a request, and that will be dependent on what your distributed system is trying to accomplish. For example, an obvious metric of interest for performance is latency, so you will measure how long the request spent on each service. Another metric that could be interesting is contention, so you can measure the contention in the CPU when the request was going through the system. One of the problems with many of these profiling tools is that they give you overall metrics for the system or a request, but when you want to find a performance issue you want to figure out if a request is an outlier or not. It is thus essential to compare the latency, contention, and memory consumption of a request to other similar request in the system to figure out whether it is abnormal.
Upvotes: 0
Reputation: 40709
I've used a method that has a pro, and a con. The pro is that it works - it finds problems that, when they are fixed, result in nice snappy performance. The con is that it's a good amount of manual work.
I even wrote a book, and included the method. The work is to collect time-stamped event logs and merge them together into a common timeline. Then you carefully examine it, tracing the flow of related messages through the network of asynchronous agents. What you are looking for are needless message cycles, or delays that don't necessarily have to happen. For example, in looking at this picture, receipt of a message is being delayed due to the task "post status to DB". When that is understood, the posting could actually be done on a separate thread.
Upvotes: 3