Ansible and Docker Swarm on a Raspberry Pi Homelab
The home setup moved from scattered services to a small Raspberry Pi cluster that is automated end to end. The goal was simple: survive SD card failures, deploy consistently, and stop rebuilding the same things by hand. A side effect was learning a great deal about automation and container orchestration along the way.
Hardware
- 8 × Raspberry Pi 4 (8 GB)
- 1 × Raspberry Pi 5 with a 2 TB NVMe drive on PCIe
- PoE hats across the Pis and an 8 port PoE switch
- A compact 3D printed rack
The Pi 5 hosts storage. Early attempts with USB attached drives on the Pi 4s were slow and fragile, so the NVMe based Pi 5 became the central store. Cooling is handled by the PoE hat fans plus small heatsinks.
Automation with Ansible
Ansible drives the entire lifecycle, from initial Pi configuration to service deployment. The layout uses roles for services and targeted playbooks for common tasks. Secrets started in plaintext during early experiments and were migrated to Ansible Vault to avoid accidental disclosure. The structure is deliberately modular to keep changes manageable.
Orchestration with Docker Swarm
Swarm was chosen for this scale as the right balance of capability and simplicity. It schedules containers across nodes, keeps services running through node failures, and was quicker to learn than alternatives. Memory pressure on smaller boards pushed the cluster toward 8 GB models for headroom. Networking, volume semantics, and Traefik integration were the main learning curve.
Persistent storage with NFS
An NFS server on the Pi 5 exposes shared storage to the swarm. The performance is acceptable for home services, and the PCIe NVMe setup on the Pi 5 helps. The difficult part was permissions and user mapping between containers and the NFS server. After that was resolved, stateful services behaved as expected.
Traefik as reverse proxy
Traefik routes traffic to services and manages HTTPS certificates via Let’s Encrypt. Early mistakes included requesting too many certificates and exposing the dashboard. The current setup keeps the dashboard internal and uses HTTP challenge validation. Labels on services let Traefik discover routes automatically.
Services running
- Home Assistant for automation. USB hardware like Z Wave requires pinning the container to the node with the attached device and storing configuration on NFS.
- Gitea for hosting personal repositories. Lightweight and straightforward to make persistent.
- Portainer for a convenient UI across the swarm, with agents on each node.
- Pi hole, Uptime Kuma, and utility containers for backups and monitoring.
Deployment workflow
- Commit configuration changes to the Git repository.
- Run the appropriate Ansible playbook.
- Ansible updates configs and secrets, then deploys stack files to Swarm.
- Traefik updates routes based on labels.
A staging stack on the same swarm is used for larger changes, with smaller updates going straight to production. A README captures the process so it is easy to return after a gap.
Lessons learned
- Start simpler than you think. A complex design was cut back to essentials and iterated from there.
- Backups before features. Configuration loss early on led to prioritising backups. A containerised restic setup now backs up critical data locally and to cloud storage.
- Document as you go. What each service does, how networks are laid out, and common fixes are all written down.
- Test restores, not just backups. Recovery runs revealed missing files that were added to the backup set.
- Automate early. Anything left manual became a friction point later.
What is next
Better monitoring and alerting, basic CI CD for configuration changes, and possibly evaluating Kubernetes as needs grow. The immediate priority is using the services effectively, not just improving the platform.
Thanks for reading. If you are running a small Swarm on Raspberry Pis and have a clean approach to NFS permissions or Traefik routing, share it so others can learn from it.