Reputation: 332
I had hard time to get the title right, but here is the problem:
Let's imagine I'm managing multiple projects hosted in docker, each project running in swarm across x nodes. Different projects range from tens to tens of thousands requests per second and demand could increase very fast for any given project.
Should I create new swarm and nodes (azure vms) for each project and scale them accordingly based on usage, which would result a large amount of small-to-large vms.
Or should I have a lot smaller pool of larger vms running maybe only one swarm which handles all the services. I'm fairly sure this would be more optimal solution because you lose the overhead of running the vm and also eliminates the vms that just stay on doing nothing because the service is currently not popular.
When you think of the pricing, cpu/ram scales linearly so there is no difference in having 1x 4 core machine vs 4x 1 core machine (unless you require large disk).
I have also had problems with vms that have miniscule amount of memory (1gb) because sometimes some random process eats all the memory and the machine is basically dead. Your load-balanced service might not need a lot of memory but you still need a lot of nodes to ensure reliability (os overhead problem with micro services).
One large swarm with large nodes makes so much sense performance/optimization wise, but I'm worried about the reliability. I know docker containers cannot have access to other containers or host data, but how about the swarm? Is it possible that one service floods/crashes the whole node or even the whole swarm and a nightmare ensues when all company's services are down.
Upvotes: 0
Views: 712
Reputation: 8596
There's no black and white answer that works for every org and every app design. If you're looking at cost and management overhead, it does benifit to have a smaller set of large nodes, so you minimize the total hosts to manage, and you reduce the OS overhead (assuming the Host OS and Docker/Swarm take up the first .5GB of memory, having fewer large instances can help reduce that waste).
I talk about typical Swarm sizing and design in this DockerCon Swarm talk.
Docker's also got some guidance for EE, which runs Docker Engine and Swarm underneath.
Personally, I would go with a smaller set of larger nodes (it's great what you can do with a single 10-node Swarm running 5-managers (only managing swarm as smaller instance sizes) and 5 (8xlarge or higher) workers on 10Gbps networking. I find that much more manageable then 50-100 xlarge on only 1Gbps, for example.
You can use resource reservations and limits, and other features like placement constraints, placement preferences, etc. to ensure services are places appropriately and prevent runaway processes from consuming all a servers resources. You can see a few examples of me doing a few of these things on GitHub and DockerCon.
Lastly, if near-10Gbps isn't good enough and you need every ounce of raw networking, consider switching out the default Swarm network driver, Overlay, for others like Host, or 3rd party plugins like Weave.
Upvotes: 4