Reputation: 2832
So I have been trying to do some load testing with Locust, and I got very different results in two scenarios - one using wait_time
between 1 and 2 and other between 0.01 and 0.1. The actual SLA for the service is under 100ms.
class PerformanceTest(HttpUser):
wait_time = between(0.01, 0.1) # second case where wait time is between 1 and 2
@task(1)
def load_test(self):
... # POST request
Parameter to run : total 100 users, ramping up at 10/sec and testing for 30 sec.
locust -f performance_test.py --headless -u 100 -r 10 --host http://127.0.0.1:8080 -t 30s
Case 1 : wait time between 0.01 and 0.1 ---> 99 percentile is 5100ms (scroll to the right)
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /rest/confidence 1300 1500 1600 1800 2000 2300 4700 5100 5500 5500 5500 1405
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 1300 1500 1600 1800 2000 2300 4700 5100 5500 5500 5500 1405
Case 2 : wait time between 1 and 2 ---> 99 percentile is 1100ms (scroll to the right)
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /rest/logincontextclassifications 460 600 670 710 870 970 1100 1100 1300 1300 1300 1251
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 460 600 670 710 870 970 1100 1100 1300 1300 1300 1251
Does anyone know why is this happening? Why is the wait_time
creating this much of difference 5100ms vs 1100ms?
Upvotes: 0
Views: 3016
Reputation: 2866
wait_time
is the amount of time in seconds between a user finishing its work and a new user being created to start the work again. The shorter the wait_time
is, the closer you get to constant requests, a consistent barrage of users immediately starting up again when they finish. In your case, that would be 100 users making their requests simultaneously.
Many systems have a cost associated with starting connections with new users. For example, even just the overhead of a simple TLS handshake and connection setup takes some amount of time that subsequent requests over the same connection won't have. A higher wait_time
spreads the work out, giving a system time to "breathe" and finish serving other responses before being forced to take on new requests.
wait_time = between(1, 2)
means when a user finishes the defined tasks, Locust will wait for a random time between 1 and 2 seconds before starting a replacement user. 2 seconds is a relatively long time for a server to wait. With response times at 1100ms, that means the server could be waiting up to a maximum of 900ms not doing anything after serving one user before serving another. That's nearly enough time to catch up on a full user's request elsewhere before getting a new request. wait_time = between(0.01, 0.1)
with 1100ms response time thus means a server will have a maximum 100ms between user requests, which is not anywhere near the time it takes to serve a full request and thus will not be able to catch up as much. In fact, once your response time exceeds your time between requests, it's possible for the system to fall further and further behind on requests. This will lead to higher and higher response times the longer the test goes on. The server may eventually become entirely unresponsive.
This is expected of any system as that's just how computers work. Each server will have some maximum number of users it can respond to within an acceptable amount of time, based on the work it's given to do for each user. To increase the system's performance you'd need to either add more servers to divide up the work or reduce the work (e.g. more efficient code to perform the same tasks) each server has to do.
Finding your system's acceptable threshold of users per second (not necessarily the same as requests per second, but that's a digression) is what Locust is made to help you do. By running multiple tests with different wait_time
s, you've helped show that your system currently would not meet its response time SLA if it's expected to serve 100 users simultaneously at any given time.
(One thing to note, though, is that your different tests are against different endpoints so the comparison may not be valid as the different endpoints may not be doing the same work. You'd want to run your multiple tests against the same endpoint to get a valid comparison.)
Upvotes: 4