High-Availability Rancher for the Home Lab

After playing with Rancher some on a small VM I decided I wanted to up my game and try a larger cluster. Having recently picked up a chunky new server I finally have the space to do just that!

Preparation - DNS

I decided that I wanted all of the various applications I was going to host on this cluster to live under a single subdomain. Since routing to specific applications will be handled by the Rancher cluster and the cluster will be behind a load balancer all I need to do is add a new CNAME record which points *.rancher to my load balancer at rancher.example.com. I also added A and PTR records for the load balancer and each of the three nodes I would be creating to my DNS server so I can use names instead of IPs in my configuration.

Preparation - HAProxy

I went with a CentOS 8 VM image because that was what was handy and set up a new instance with 2CPU cores and 1GB of RAM. Since I’m only going to be installing HAProxy and will be keeping minimal logs I kept the disk size at a default 10GB. I also went ahead and enabled Cockpit since it is now available by default and the web interface can be handy.

$ systemctl enable --now cockpit.socket

Since this is the same template image as my previous FreeIPA upgrade it also didn’t come with firewalld enabled so that needed fixing as well.

$ dnf install -y firewalld
$ systemctl enable firewalld
$ systemctl start firewalld

I also need to allow a few services through the firewall so both Cockpit and HAProxy are able to do what I want.

$ firewall-cmd --permanent --add-service={ssh,cockpit,http,https}
$ firewall-cmd --reload

Now we can go ahead and install HAProxy which will be acting as our Layer 4 load balancer for the cluster.

$ dnf install -y haproxy

The configuration to get Rancher working is surprisingly easy and the underlying architecture is described here in the documentation. We will only be working with things at the TCP level so we can let Rancher handle SSL termination and all of that work. All we need to do is make sure that ports 80 and 443 are mapped back to our cluster’s control plane nodes.

# ----------
# /etc/haproxy/haproxy.cfg
# ----------

global
    maxconn 4096

defaults
    balance roundrobin
    option redispatch

    timeout connect 5s
    timeout queue 5s
    timeout client 36000s
    timeout server 36000s

frontend rancher_http
    bind *:80
    mode tcp
    default_backend rancher_http_backend

frontend rancher_https
    bind *:443
    mode tcp
    default_backend rancher_https_backend

backend rancher_http_backend
    mode tcp
    option tcp-check
    server rancher-1 rancher-1.example.com:80 check
    server rancher-2 rancher-2.example.com:80 check
    server rancher-3 rancher-3.example.com:80 check

backend rancher_https_backend
    mode tcp
    option tcp-check
    server rancher-1 rancher-1.example.com:443 check
    server rancher-2 rancher-2.example.com:443 check
    server rancher-3 rancher-3.example.com:443 check

Now we can enable the HAProxy service so that it comes up automatically if we ever need to reboot our load balancer.

$ systemctl enable haproxy

We actually don’t want to start HAProxy yet because none of our nodes are running and it’s liveliness checks will fail, but when we’re ready we will use the following.

$ systemctl start haproxy

Preparation - VMs

Just like in my previous post about Rancher, we’re going to be combining the Kubernetes control plane, etcd nodes, and workload hosting onto the same servers. However this time we’re going to be beefing up each server and we are also going to be deploying three of them. We could choose to have smaller control plane nodes with additional worker nodes but I felt like a three master all in one approach seemed like a simple and effective test.

I increased the VM resources from 2 CPUs and 4GB of RAM to 4 CPUs and 8GB of RAM per host and included a 100GB thin provisioned disk for each as well. With three of those prepped and loaded with the RancherOS ISO I was ready to boot them up and start installing.

Install - RancherOS

The install followed the same pattern as before but I went ahead and created a separate cloud-config.yml files for each of the nodes so I wouldn’t need to edit anything in place. I also recently learned that qemu-guest-agent does in fact come bundled with RancherOS so I enabled that by adding the following under the rancher key in the cloud-config.

# rancher-1-cloud-config.yml
rancher:
  ...
  services_include:
    qemu-guest-agent: true

Once the install finished, I detached the ISOs and rebooted the nodes to get them ready for installing Kubernetes.

Install - Kubernetes With RKE

Again, the install for a multinode configuration was very similar to the single node I had tried before. I created a new cluster configuration file which included all of my nodes:

# rancher-cluster.yml
ssh_agent_auth: true
nodes:
  - address: rancher-1.example.com
    user: rancher
    role: [controlplane, worker, etcd]
  - address: rancher-2.example.com
    user: rancher
    role: [controlplane, worker, etcd]
  - address: rancher-3.example.com
    user: rancher
    role: [controlplane, worker, etcd]
services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Then I installed Kubernetes with rke up --config ./rancher-cluster.yml. The ssh_agent_auth line is only required because I have my SSH key loaded in a HSM and the only way to get a hold of it is through ssh-agent when trying to run an automated process. Even more impressive this time is that everything came up correctly during the first run, no network issues at all!

Install - Rancher

Installing Rancher itself was also pretty easy since I went with certificates that I had generated from my personal CA. More detail can be found in my previous post but the simple version is to create a configuration YAML file for Rancher which sets the cluster hostname to the DNS name of your HAProxy load balancer.

# rancher-values.yml
hostname: rancher.example.com
ingress:
  tls:
    source: secret
privateCA: true

Then install as described previously with Helm, passing in the values you defined.

$ helm install rancher rancher-latest/rancher \
    --namespace cattle-system
    --values ./rancher-values.yml

Create your TLS secrets and wait for the deploy to finish. Once at least the rancher pods are running you can start HAProxy as well. You can check by running kubectl --namespace cattle-system get pods and looking for rancher. It literally could not have gone smoother.

Storage - NFS

This time around I also wanted to add a storage class which would permit pod migration between my different hosts. I already have several NFS servers available and there is a handy Helm chart for a NFS storage class. I provided the information about my NFS exports as parameters to the chart and installed it.

# nfs-values.yml
replicaCount: 3
storageClass:
  name: example-nfs-client
nfs:
  server: nfs.example.com
  path: /var/nfs/k8s-data

$ helm install nfs-client stable/nfs-client-provisioner --values ./nfs-values.yml

This install method puts the provisioner in the global namespace instead of looped under a specific project namespace. It would probably be more secure to do it on a per project basis but for now global is fine for testing purposes.

Maintenance - Shutdown and Reboot

The last thing I wanted to test before I called this experiment a success was shutting down and rebooting the cluster. As it turns out, also pretty easy to do but did run into a small gotcha with HAProxy. I used the Rancher web UI to accomplish all of these tasks but you could just as easily use kubectl if you wanted.

First I gracefully drained two of my nodes before shutting them down allowing all of my workloads to migrate onto the remaining node. The hope here was that it would allow all of the etcd pods to migrate and I wouldn’t have issues when it came back up. Once the first two nodes were down I performed a graceful shutdown on the last node before shutting down the HAProxy VM as well.

I began my restart procedure by rebooting the HAProxy VM and then rebooting the RancherOS VMs in reverse order. As it turned out this order was a mistake. When HAProxy started up it couldn’t see Rancher running on port 80 or 443 so it actually failed to come up at all. The problem is that the cattle-node-agent pods and cattle-cluster-agent pods were also pinging rancher.example.com AKA HAProxy trying to get a heartbeat.

The solution ended up being to wait for the rancher pods to become ready and then restart HAProxy. Once the other pods overcame their crash loop backoff timers they were able to get the heartbeat successfully and the system came back up. From there all that I needed to do was uncordon the two drained nodes so they were schedulable again and I was good to go!

Conclusions

There’s still a lot of testing to do to see how fit for purpose this kind of cluster is for my workloads but the initial install was smooth and painless. Hopefully keeping it running is the same!