I’m working on a Docker Swarm deployment to production and these are some of my thoughts on how things should be. In my case the deployment is on AWS but I’m deliberately skipping the AWS-specific parts on this blog post.

Know the Basics

A swarm will consist of managers, nodes and services.
A node is basically a worker of your swarm, it will only receive instructions.
A manager is a node that can give instructions to your swarm (i.e. creating and removing services). Managers can also run instances of services.

A service can have multiple replicas and Docker will distribute them evenly across your nodes - if you have 3 (active) nodes and a service with 3 replicas, each node will have a running instance of that service.

For any service in the swarm, Docker will do a round-robin between the replicas (shameless plug, watch Load balancing is impossible). Docker call this feature of distributing incoming connections from outside the swarm ingress load balancing. Is important to say that if you access a service from within the swarm it will also go through the built-in load balancer.

If you try to access a service on a node that’s not running it Docker will re-reoute the request to a node that has a running instance of that service.

Here the component labeled LB is serving as a reverse proxy and it will have to be able to discover new nodes in your swarm dynamically.

Networks

For my very basic setup I created an encrypted overlay network for - initially - all my services.

1
docker network create --driver overlay --subnet 10.0.9.0/24 --opt encrypted my-network

Of course you can create specific networks for groups of services but that will depend on your requirements and use cases.

Communication between services

By default Docker Swarm configures DNS service discovery for you so you will be able to access from a service to another service in your swarm via its name, if they share the same network.

Registry

In my case, I configured a private - HTTP only - registry to serve my Docker images (see Configuring a local Docker Registry). Since it wont be public and I only need it available in the virtual network of my instances I don’t see any reason why I should configure TLS on it.

Autoscalling groups

It’s very important to have redundancy of managers in your swarm because if you have only one manager and something goes wrong, you will lose the entire swarm. That being said, it’s also important to have an odd number of managers due to the nature of the algorithm that Swarm uses to choose their leader (read more in this section). You should also consistently backup the swarm state.

Ideally you will count with two autoscalling groups, one for your managers and another one for your nodes.

For small clusters I see no problem on letting the managers run service’s instances but in big clusters you should have only dedicated managers - see how to do it here.

Conclusion

This blog post focus on architecture decisions but definitely the development and deployment pipelines are two other important parts that are missing. I may write something once I’m more familiar with them.

If you have configured your swarm in a different way or I’m missing some neat features tell me in the comments!

Other resources


Thanks to Martin and Ezequiel for reviewing this post and helping me to tune up the diagram.