jSebestyén
jSebestyén

Reputation: 1806

What causes a server to close a TCP/IP connection abruptly with a Reset (RST Flag)?

TL;DR

For quite some time we are facing a weird issue with all of our systems (including Prod!). On a regular basis the TCP-connection to the server is closed abruptly by the server (or to be exact on the way from the server to the client). This leads to failing requests and is most prominent in file uploads that always fail for bigger files (where bigger is just >100kb). Additionally the same requests fail much less frequently (but still fail sometimes!) if routed through an nginx reverse proxy.

Setup

We (let's call us MyCompany) are developing a software (a Java/Spring Boot service) for CustomerCompany. The software is shipped as a Docker container and hosted either locally, in a private cloud provided by CloudCompany or in two different Azure Kubernetes cluster. The software communicates with an SAP-system hosted by SAPHostingCompany. There are actually multiple SAP-systems for different stages.

The software communicates (depending on stage/environment) either directly with the SAP-system or through an nginx reverse proxy (hosted on a machine of MyCompany). The reasoning behind the nginx reverse proxy is that each IP communicating with the SAP-system has to be whitelisted by SAPHostingCompany. Especially for local development this would have been quite cumbersome to maintain.

The problem

Starting a few weeks back we noticed that sometimes requests fail (seemingly) randomly. This happens on all stages. Supposedly there were no changes whatsoever conducted that might have caused this change...

While this is quite an annoyance for most requests (that can just be retried if they failed) this completely prevents larger files from being uploaded. Larger meaning just >100kb in this context.

We tried to investigate the problem and noticed in tcpdump that upon failure the server sends a TCP RST packet, thus aborting the connection (admittedly, we cannot be 100% sure whether it's the server itself sending the RST or some intermediate component). This is sent at different stages within the TCP-connection so there is not one single packet (or packet-combination) that immediately causes the server to close the connection.

Most interestingly, this failure happens far less often (but still does!) in the setup with the intermediate nginx reverse proxy.

Nginx Reverse proxy

The nginx config looks like this:

events {
worker_connections 1024;
}

http {
log_format combined_with_requesttime '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $pipe';
log_format combined_with_token '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format combined_with_token_host '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format xcombined '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$ssl_client_s_dn"';

sendfile    on;
server_tokens on;
types_hash_max_size 1024;
types_hash_bucket_size 512;
server_names_hash_bucket_size 64;
server_names_hash_max_size 512;
keepalive_timeout  65;
tcp_nodelay        on;

client_max_body_size    10m;
client_body_buffer_size 128k;
proxy_redirect          off;
proxy_connect_timeout   90;
proxy_send_timeout      90;
proxy_read_timeout      90;
proxy_buffers           32 4k;
proxy_buffer_size       8k;
proxy_set_header        Host $host;
proxy_set_header        X-Real-IP $remote_addr;
proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_headers_hash_bucket_size 64;

server {
    listen                0.0.0.0:8080 default_server;
    server_name           _;
    resolver              127.0.0.11 valid=30s;

    access_log            /dev/stdout combined_with_token_host;
    error_log             /dev/stdout debug;

    underscores_in_headers on; # Fuer Uebertragung der Header an SAP
    large_client_header_buffers 4 16k;
    proxy_buffer_size           16k;
    proxy_buffers               4 16k;
    real_ip_header              <blurred>;
    set_real_ip_from            0.0.0.0/0;

    location /sap1/ {
        rewrite ^ $request_uri;
        rewrite ^/sap1/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap1:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  30m;
    }

    location /sap2/ {
        rewrite ^ $request_uri;
        rewrite ^/sap2/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap2:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }

    location /sap3/ {
        rewrite ^ $request_uri;
        rewrite ^/sap3/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap3:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }
}
}

The server accepts only TLS-secured connections. One difference is the establishment of the TLS-connection:

software <-TLS-secured-> SAP vs software <-unsecured-> nginx <-TLS-secured-> SAP

Here is an example of a successful request: Successful tcp dump

And here the same request aborted with an RST flag: enter image description here

Here the connection is aborted immediately after the client sends a Certificate, Client Key Exchange, Change Ciper Spec, Encrypted Handshake Message but it might fail at any point. For example in most file upload errors ~10-20 data packets are sent successfully before the connection is aborted.

Conclusion

We are at a complete loss what else to investigate/how to narrow this down. Unfortunately SAPHostingCompany is not very forthcoming in this bug-hunt :( We, of course, think it must be some kind of infrastructure problem on their side since the error appeared on all stages/environments simultaneously while they blame us since the nginx-solution seems to work...

So if anybody has a clue as to what might be going on here I would be very grateful.

Related question

Upon this quest I stumbled upon this question. This user was facing regular RSTs after a constant amount of time (which is not what we are experiencing). Some of the proposed solutions do sound promising but SAPHostingCompany assures us that none of those apply (again, communication between MyCompany and SAPHostingCompany is quite difficult)... Unfortunately we lack the required know-how to determine which solutions might actually be feasible to explain and fix our problem.

Upvotes: 0

Views: 2173

Answers (0)

Related Questions