Deploying +100 servers with GitHub Actions

Last year we decided to replace our Ubuntu servers with NixOS. We were still on Ubuntu 14.04, so we also had to upgrade all the services and tools, including PHP, MySQL, HAProxy, etc.

As you can imagine, this was a lot of work, and you might be asking yourself, “why didn’t you just upgrade Ubuntu?” and it’s a good question. One that we asked ourselves several times, but we kept coming to the same answer: NixOS is declarative, reproducible, and it’s hard to break it (as opposed to Ubuntu, where you often break something without even noticing). To learn more about NixOS, I highly encourage you to visit their site.

But the story doesn’t end there (or this would be a very short and dull blog post). Because of how our infrastructure is built, we have a lot of servers with minimal resources. We are talking about 1GB – 2GB of RAM and 1 CPU (e.g., AWS t2 small). And, of course, we want to utilize our servers fully and not leave money on the table. On average, they would run ~15 services (e.g., Telegraf, MySQL, HAProxy, Apache, Varnish, PHP-FPM, Redis, and many of our internal tools), so they would have very little memory free.

In NixOS, whenever you change something (e.g., HAProxy configuration, add/remove a user, etc.), you need to rebuild it. But this takes a lot of memory (and CPU, but that is generally not an issue) which is problematic, especially if you don’t have any. And what happens if you try to rebuild NixOS with not enough memory? Nothing. OS kills the process (nix-build) because it detects it doesn’t have enough memory (systemd Out-Of-Memory Killer).

But how can we change our system if we need to rebuild it for every change and we can’t rebuild it because we don’t have enough memory? It turns out NixOS can be rebuilt remotely, and then you push activation scripts and needed packages to the server.

We used GitHub Actions to build NixOS and then pushed all the necessary packages and activation scripts to the server. This sounds a lot harder than it is. It’s basically one command: nixos-rebuild switch --target-host root@$SERVER_IP --build-host localhost -I nixpkgs=$NIXPKGS -I nixos-config=configuration.nix with some preparation and cleaning up.

Unfortunately, we still had a problem. Rebuilding all those packages takes a lot of time, and this was causing a lot of problems for us. Having to wait +10 minutes for applying some change to a server is a lot of time. So what could we do? Cache to the rescue. Well, to be more specific, Cachix to the rescue. Cachix is a fantastic service for caching and sharing binaries across your system. It’s super fast and very reliable. So far, we haven’t had any issues with it whatsoever.

Because all servers’ configuration is in a GitHub repository, we need to make a PR to change their configuration. The PR needs to get approved and then merged. After that, everything is automated. GitHub Action is triggered, and servers are rebuilt (the ones that got affected by the change in the PR). So no more hacking via SSH or running ansible tasks that can break the whole server. Knowing all servers have the same configuration and that configuration is in a git repository that can’t be changed unless approved can really improve your sleep. I highly recommend it.

If there are two things you should take away from this blog post:

Always rebuild NixOS remotely.
Use Cachix to speed up the rebuilds.