Reputation: 17
We are facing issues with varnish Max Threads hit & backend and session connections spikes. We are not sure about the cause, but what we have observed is it happens when the origin servers have high response times and eventually return uncacheable (502) responses.
Varnish usage :
We've configured varnish behind nginx proxy, so the incoming requests first hit nginx and then is consistently balanced to n varnish. Varnish, in case of miss, call the origin nginx host, here example.com.
In our case, we only cache HTTP GET requests and all of them have JSON payload in response, size ranging from 0.001 MB to 2 MB.
Example request :
HTTP GET : http://test.com/test/abc?arg1=val1&arg2=val2
Expected xkey : test/abc
Response : Json payload
Approx QPS : 60-80 HTTP GET Requests
Avg obj ttl : 2d
Avg obj grace : 1d
Attaching the vcl file, statistics and varnish run command for debugging purpose.
Monitoring Stats :
Varnish and VCL Configuration :
Varnish version : Linux,5.4.0,x86_64,varnish-6.5.1
varnishd -F -j unix,user=nobody -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -s file,/opt/varnishdata/cache,750G
vcl 4.0;
import xkey;
import std;
acl purgers {
"localhost";
}
backend default {
.host = "example.com";
.port = "80";
}
sub vcl_recv {
unset req.http.Cookie;
if (req.method == "PURGE") {
if (client.ip !~ purgers) {
return (synth(403, "Forbidden"));
}
if (req.http.xkey) {
set req.http.n-gone = xkey.softpurge(req.http.xkey);
return (synth(200, "Invalidated "+req.http.n-gone+" objects"));
}
else {
return (purge);
}
}
# remove request id from request
set req.url = regsuball(req.url, "reqid=[-_A-z0-9+()%.]+&?", "");
# remove trailing ? or &
set req.url = regsub(req.url, "[?|&]+$", "");
# set hostname for backend request
set req.http.host = "example.com";
}
sub vcl_backend_response {
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
# extract xkey
if (bereq.url ~ "/some-string/") {
set beresp.http.xkey = regsub (bereq.url,".*/some-string/([^?]+).*","\1");
}
# This block will make sure that if the upstream return a 5xx, but we have the response in the cache (even if it's expired),
# we fall back to the cached value (until the grace period is over).
if ( beresp.status != 200 && beresp.status != 422 ){
# This check is important. If is_bgfetch is true, it means that we've found and returned the cached object to the client,
# and triggered an asynchronous background update. In that case, if it was a 5xx, we have to abandon, otherwise the previously cached object
# would be erased from the cache (even if we set uncacheable to true).
if (bereq.is_bgfetch)
{
return (abandon);
}
# We should never cache a 5xx response.
set beresp.uncacheable = true;
}
}
sub vcl_deliver {
unset resp.http.X-Varnish;
unset resp.http.Via;
set resp.http.X-Cached = req.http.X-Cached;
}
sub vcl_hit {
if (obj.ttl >= 0s) {
set req.http.X-Cached = "HIT";
return (deliver);
}
if (obj.ttl + obj.grace > 0s) {
set req.http.X-Cached = "STALE";
return (deliver);
}
set req.http.X-Cached = "MISS";
}
sub vcl_miss {
set req.http.X-Cached = "MISS";
}
Please let us know if there are any suggestions to improve the current configuration or anything else required to debug the same.
Thanks
Abhishek Surve
Upvotes: 0
Views: 2288
Reputation: 4808
If you run out of threads, from a firefighting point of view it makes sense to increase the threads per thread pool.
Here's a varnishstat
command that displays realtime thread consumption and potential thread limits:
varnishstat -f MAIN.threads -f MAIN.threads_limited
Press the d
key to display fields with a zero value.
If the MAIN.threads_limited
increases, we know you have exceeded the maximum threads per pool that is set by the thread_pool_max
runtime parameter.
It makes sense to display the current thread_pool_max
value by executing the following command:
varnishadm param.show thread_pool_max
You can use varnishadm param.show
to set the new thread_pool_max
value, but it is not persisted and won't survive a restart.
The best way is to set it though a -p
parameter in your systemd service file.
I noticed you're using the file
stevedore to store large volumes of data. We strongly advise against using it, because it is very sensitive to disk fragmentation. It can slow down Varnish when it has to perform too many disk seeks and relies too much on the kernel's page cache to be efficient.
On open source Varnish, -s malloc
is still your best bet. You can increase your cache capacity through horizontal scaling and having 2 tiers of Varnish.
The most reliable way to use disk for large volumes of data is Varnish Enterprise's Massive Storage Engine. It's not free and open source, but it was built specifically to counter the poor performance of the file
stevedore.
Based on how you're describing the problem, it looks like Varnish has to spend too much time dealing with uncached responses. This requires a backend connection.
Luckily Varnish lets go of the backend thread and allows client threads to deal with other tasks while Varnish is waiting for the backend to respond.
But if we can limit the number of backend fetches, maybe we can improve the overall performance of Varnish.
I'm not too concerned about cache misses, because a cache miss is a hit that hasn't happened yet, however we can look at the requests that cause the most cache misses by running the following command:
varnishtop -g request -i requrl -q "VCL_Call eq 'MISS'"
This will list the URL of the top misses. You can then drill down on individual request and figure out why cause cache misses so often.
You can use the following command to inspect the logs for a specific URL:
varnishlog -g request -q "ReqUrl eq '/my-page'"
Please replace /my-page
with the URL of the endpoint you're inspecting.
For cache misses, we care about their TTL. Maybe the TTL was set too low. The TTL
tag will show you which TTL value is used.
Also keep an eye on the Timestamp
tags, because they can highlight any potential slowdown.
Uncacheable content is more dangerous than uncached content. A cache miss will eventually result in a hit, whereas a cache bypass will always be uncacheable and will always require a backend fetch.
The following command will list your top cache bypasses by URL:
varnishtop -g request -i requrl -q "VCL_Call eq 'PASS'"
Then again, you can drill down using the following command
varnishlog -g request -q "ReqUrl eq '/my-page'"
It's important to understand why Varnish would bypass the cache for certain requests. The built-in VCL describes this process. See https://www.varnish-software.com/developers/tutorials/varnish-builtin-vcl/ for more information about the built-in VCL.
Typical things you should look for:
GET
or HEAD
Authorization
headerCookie
headerSet-Cookie
headers-maxage=0
or a max-age=0
directive in the Cache-Control
headerprivate
, no-cache
or no-store
directive in the Cache-Control
headerVary: *
headerYou can also run the following command to figure out how many passes take place on your system:
varnishstat -f MAIN.s_pass
If that is too high, you might want to write some VCL that handles Authorization
headers, Cookie
headers and Set-Cookie
headers.
The conclusion can also be that you need to optimize your Cache-Control
headers.
If you've done all the optimization you can and you still get a lot of cache bypasses, you need to scale out your platform a bit more.
One line of VCL that caught my eye is the following:
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
You are using an X-Cache-ttl
response header to set the TTL. Why would you do that if there is a conventional Cache-Control
header for that?
An extra risk is that fact that the built-in VCL cannot handle this and cannot properly mark these requests as uncacheable.
The most dangerous thing that can happen is that you set beresp.ttl = 0
through this header and that you hit a scenario where set beresp.uncacheable = true
is reached in your VCL.
If the beresp.ttl
remains zero at that point, Varnish will not be able to store Hit-For-Miss objects in the cache for these situations. This means that subsequent requests for this resource will be added to the waiting list. But because we're dealing with uncacheable content, these requests will never be satisfied by Varnish's request coalescing mechanism.
The result is that the waiting list will be processed serially and this will increase the waiting time, which might result in exceeding the available threads.
My advice is to add set beresp.ttl = 120s
right before you set set beresp.uncacheable = true;
. This will ensure Hit-For-Miss objects are created for uncacheable content.
To build on the entire conventional header argument, please remove the following lines of code from your VCL:
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
Replace this logic with the proper use of Cache-Control
headers.
Here's an example of a Cache-Control
header with a 3600s TTL and a 1 day grace:
Cache-Control: public, s-maxage=3600, stale-while-revalidate=86400
This feedback is not related to your problem, but is just a general best practice.
At this point it's not really clear what the root cause of your problem is. You talk about threads and slow backends.
On the one hand I have given you ways to inspect the thread pool usage and a way to increase the threads per pool.
On the other hand, we need to look at potential cache misses and cache bypass scenarios that might disrupt the balance on the system.
If certain headers cause unwanted cache bypasses, we might be able to improve the situation by writing the proper VCL
And finally, we need to ensure you are not adding requests to the waitlist if they are uncacheable.
Upvotes: 2