Reputation: 5805
We want to run MinIO in a distributed / high-availability setup, but would like to know a bit more about the behavior of MinIO under different failure scenario's. Especially given the read-after-write consistency, I'm assuming that nodes need to communicate.
So what happens if a node drops out? Will there be a timeout from other nodes, during which writes won't be acknowledged? What happens during network partitions (I'm guessing the partition that has quorum will keep functioning), or flapping or congested network connections? What if a disk on one of the nodes starts going wonky, and will hang for 10s of seconds at a time? Will the network pause and wait for that?
Is there any documentation on how MinIO handles failures?
Upvotes: 2
Views: 3518
Reputation: 157
Minio has Healthcheck Probe
In distributed minio environment you can use reverse proxy service in front of your minio nodes. For example Caddy proxy, that supports the health check of each backend node.
Here is the examlpe of caddy proxy configuration I am using.
if you want tls termiantion /etc/caddy/Caddyfile looks like this
http://yuourminio.examlple.com:80 { redir https://{host}{uri} } yuourminio.examlple.com:443 { tls /etc/ssl/certs/certbundle.pem /etc/ssl/certs/private.key proxy / minio01-ip:9000 minio02-ip:9000 minio03-ip:9000 minio04-ip:9000 { header_upstream X-Forwarded-Proto {scheme} header_upstream X-Forwarded-Host {host} header_upstream Host {host} health_check /minio/health/ready health_check_interval 3s }
without tls
yuourminio.examlple.com:80 {
proxy / minio01-ip:9000 minio02-ip:9000 minio03-ip:9000 minio04-ip:9000 {
header_upstream X-Forwarded-Proto {scheme}
header_upstream X-Forwarded-Host {host}
header_upstream Host {host}
health_check /minio/health/ready
health_check_interval 3s
}
Minio node also can send metrics to prometheus, so you can build grafana deshboard and monitor Minio Cluster nodes. (minio disks, cpu, memory, network)
for more please check docs:
https://docs.min.io/docs/minio-monitoring-guide.html
https://docs.min.io/docs/setup-caddy-proxy-with-minio.html
Upvotes: 1
Reputation: 5805
Since MinIO promises read-after-write consistency, I was wondering about behavior in case of various failure modes of the underlaying nodes or network.
The MinIO documentation (https://docs.min.io/docs/distributed-minio-quickstart-guide.html) does a good job explaining how to set it up and how to keep data safe, but there's nothing on how the cluster will behave when nodes are down or (especially) on a flapping / slow network connection, having disks causing I/O timeouts, etc.
This issue (https://github.com/minio/minio/issues/3536) pointed out that MinIO uses https://github.com/minio/dsync internally for distributed locks.
From the documentation:
A node will succeed in getting the lock if n/2 + 1 nodes (whether or not including itself) respond positively.
I think this is a pretty nice model:
Please note that, if we're connecting clients to a MinIO node directly, MinIO doesn't in itself provide any protection for that node being down. We still need some sort of HTTP load-balancing front-end for a HA setup. (which might be nice for asterisk / authentication anyway.)
Note 2; This is a bit of guesswork based on documentation of MinIO and dsync, and notes on issues and slack. If haven't actually tested these failure scenario's, which is something you should definitely do if you want to run this in production.
Upvotes: 3