s510
s510

Reputation: 2832

Load testing using Locust

So I have been trying to do some load testing with Locust, and I got very different results in two scenarios - one using wait_time between 1 and 2 and other between 0.01 and 0.1. The actual SLA for the service is under 100ms.

class PerformanceTest(HttpUser):
    wait_time = between(0.01, 0.1) # second case where wait time is between 1 and 2

    @task(1)
    def load_test(self):
    ... # POST request

Parameter to run : total 100 users, ramping up at 10/sec and testing for 30 sec.

locust -f performance_test.py --headless -u 100 -r 10 --host http://127.0.0.1:8080 -t 30s

Case 1 : wait time between 0.01 and 0.1 ---> 99 percentile is 5100ms (scroll to the right)

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /rest/confidence                                                    1300   1500   1600   1800   2000   2300   4700   5100   5500   5500   5500   1405
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                           1300   1500   1600   1800   2000   2300   4700   5100   5500   5500   5500   1405

Case 2 : wait time between 1 and 2 ---> 99 percentile is 1100ms (scroll to the right)

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /rest/logincontextclassifications                                                     460    600    670    710    870    970   1100   1100   1300   1300   1300   1251
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                            460    600    670    710    870    970   1100   1100   1300   1300   1300   1251

Does anyone know why is this happening? Why is the wait_time creating this much of difference 5100ms vs 1100ms?

Upvotes: 0

Views: 3016

Answers (1)

Solowalker
Solowalker

Reputation: 2866

wait_time is the amount of time in seconds between a user finishing its work and a new user being created to start the work again. The shorter the wait_time is, the closer you get to constant requests, a consistent barrage of users immediately starting up again when they finish. In your case, that would be 100 users making their requests simultaneously.

Many systems have a cost associated with starting connections with new users. For example, even just the overhead of a simple TLS handshake and connection setup takes some amount of time that subsequent requests over the same connection won't have. A higher wait_time spreads the work out, giving a system time to "breathe" and finish serving other responses before being forced to take on new requests.

wait_time = between(1, 2) means when a user finishes the defined tasks, Locust will wait for a random time between 1 and 2 seconds before starting a replacement user. 2 seconds is a relatively long time for a server to wait. With response times at 1100ms, that means the server could be waiting up to a maximum of 900ms not doing anything after serving one user before serving another. That's nearly enough time to catch up on a full user's request elsewhere before getting a new request. wait_time = between(0.01, 0.1) with 1100ms response time thus means a server will have a maximum 100ms between user requests, which is not anywhere near the time it takes to serve a full request and thus will not be able to catch up as much. In fact, once your response time exceeds your time between requests, it's possible for the system to fall further and further behind on requests. This will lead to higher and higher response times the longer the test goes on. The server may eventually become entirely unresponsive.

This is expected of any system as that's just how computers work. Each server will have some maximum number of users it can respond to within an acceptable amount of time, based on the work it's given to do for each user. To increase the system's performance you'd need to either add more servers to divide up the work or reduce the work (e.g. more efficient code to perform the same tasks) each server has to do.

Finding your system's acceptable threshold of users per second (not necessarily the same as requests per second, but that's a digression) is what Locust is made to help you do. By running multiple tests with different wait_times, you've helped show that your system currently would not meet its response time SLA if it's expected to serve 100 users simultaneously at any given time.

(One thing to note, though, is that your different tests are against different endpoints so the comparison may not be valid as the different endpoints may not be doing the same work. You'd want to run your multiple tests against the same endpoint to get a valid comparison.)

Upvotes: 4

Related Questions