High-Availability Rancher for the Home Lab
After playing with Rancher some on a small VM I decided I wanted to up my game and try a larger cluster. Having recently picked up a chunky new server I finally have the space to do just that!
Preparation - DNS
I decided that I wanted all of the various applications I was going to host on
this cluster to live under a single subdomain. Since routing to specific
applications will be handled by the Rancher cluster and the cluster will be
behind a load balancer all I need to do is add a new CNAME record which points
*.rancher
to my load balancer at rancher.example.com
. I also added A and
PTR records for the load balancer and each of the three nodes I would be creating
to my DNS server so I can use names instead of IPs in my configuration.
Preparation - HAProxy
I went with a CentOS 8 VM image because that was what was handy and set up a new instance with 2CPU cores and 1GB of RAM. Since I’m only going to be installing HAProxy and will be keeping minimal logs I kept the disk size at a default 10GB. I also went ahead and enabled Cockpit since it is now available by default and the web interface can be handy.
$ systemctl enable --now cockpit.socket
Since this is the same template image as my previous FreeIPA upgrade it also didn’t come with firewalld enabled so that needed fixing as well.
$ dnf install -y firewalld
$ systemctl enable firewalld
$ systemctl start firewalld
I also need to allow a few services through the firewall so both Cockpit and HAProxy are able to do what I want.
$ firewall-cmd --permanent --add-service={ssh,cockpit,http,https}
$ firewall-cmd --reload
Now we can go ahead and install HAProxy which will be acting as our Layer 4 load balancer for the cluster.
$ dnf install -y haproxy
The configuration to get Rancher working is surprisingly easy and the underlying architecture is described here in the documentation. We will only be working with things at the TCP level so we can let Rancher handle SSL termination and all of that work. All we need to do is make sure that ports 80 and 443 are mapped back to our cluster’s control plane nodes.
# ----------
# /etc/haproxy/haproxy.cfg
# ----------
global
maxconn 4096
defaults
balance roundrobin
option redispatch
timeout connect 5s
timeout queue 5s
timeout client 36000s
timeout server 36000s
frontend rancher_http
bind *:80
mode tcp
default_backend rancher_http_backend
frontend rancher_https
bind *:443
mode tcp
default_backend rancher_https_backend
backend rancher_http_backend
mode tcp
option tcp-check
server rancher-1 rancher-1.example.com:80 check
server rancher-2 rancher-2.example.com:80 check
server rancher-3 rancher-3.example.com:80 check
backend rancher_https_backend
mode tcp
option tcp-check
server rancher-1 rancher-1.example.com:443 check
server rancher-2 rancher-2.example.com:443 check
server rancher-3 rancher-3.example.com:443 check
Now we can enable the HAProxy service so that it comes up automatically if we ever need to reboot our load balancer.
$ systemctl enable haproxy
We actually don’t want to start HAProxy yet because none of our nodes are running and it’s liveliness checks will fail, but when we’re ready we will use the following.
$ systemctl start haproxy
Preparation - VMs
Just like in my previous post about Rancher, we’re going to be combining the Kubernetes control plane, etcd nodes, and workload hosting onto the same servers. However this time we’re going to be beefing up each server and we are also going to be deploying three of them. We could choose to have smaller control plane nodes with additional worker nodes but I felt like a three master all in one approach seemed like a simple and effective test.
I increased the VM resources from 2 CPUs and 4GB of RAM to 4 CPUs and 8GB of RAM per host and included a 100GB thin provisioned disk for each as well. With three of those prepped and loaded with the RancherOS ISO I was ready to boot them up and start installing.
Install - RancherOS
The install followed the same pattern as before but I went ahead
and created a separate cloud-config.yml
files for each of the nodes so I wouldn’t
need to edit anything in place. I also recently learned that qemu-guest-agent
does in fact come bundled with RancherOS so I enabled that by adding the following
under the rancher
key in the cloud-config.
# rancher-1-cloud-config.yml
rancher:
...
services_include:
qemu-guest-agent: true
Once the install finished, I detached the ISOs and rebooted the nodes to get them ready for installing Kubernetes.
Install - Kubernetes With RKE
Again, the install for a multinode configuration was very similar to the single node I had tried before. I created a new cluster configuration file which included all of my nodes:
# rancher-cluster.yml
ssh_agent_auth: true
nodes:
- address: rancher-1.example.com
user: rancher
role: [controlplane, worker, etcd]
- address: rancher-2.example.com
user: rancher
role: [controlplane, worker, etcd]
- address: rancher-3.example.com
user: rancher
role: [controlplane, worker, etcd]
services:
etcd:
snapshot: true
creation: 6h
retention: 24h
Then I installed Kubernetes with rke up --config ./rancher-cluster.yml
. The
ssh_agent_auth
line is only required because I have my SSH key loaded in a
HSM and the only way to get a hold of it is through ssh-agent
when trying to
run an automated process. Even more impressive this time is that everything came
up correctly during the first run, no network issues at all!
Install - Rancher
Installing Rancher itself was also pretty easy since I went with certificates that I had generated from my personal CA. More detail can be found in my previous post but the simple version is to create a configuration YAML file for Rancher which sets the cluster hostname to the DNS name of your HAProxy load balancer.
# rancher-values.yml
hostname: rancher.example.com
ingress:
tls:
source: secret
privateCA: true
Then install as described previously with Helm, passing in the values you defined.
$ helm install rancher rancher-latest/rancher \
--namespace cattle-system
--values ./rancher-values.yml
Create your TLS secrets and wait for the deploy to finish. Once at least the
rancher
pods are running you can start HAProxy as well. You can check by running
kubectl --namespace cattle-system get pods
and looking for rancher
.
It literally could not have gone smoother.
Storage - NFS
This time around I also wanted to add a storage class which would permit pod migration between my different hosts. I already have several NFS servers available and there is a handy Helm chart for a NFS storage class. I provided the information about my NFS exports as parameters to the chart and installed it.
# nfs-values.yml
replicaCount: 3
storageClass:
name: example-nfs-client
nfs:
server: nfs.example.com
path: /var/nfs/k8s-data
$ helm install nfs-client stable/nfs-client-provisioner --values ./nfs-values.yml
This install method puts the provisioner in the global namespace instead of looped under a specific project namespace. It would probably be more secure to do it on a per project basis but for now global is fine for testing purposes.
Maintenance - Shutdown and Reboot
The last thing I wanted to test before I called this experiment a success was
shutting down and rebooting the cluster. As it turns out, also pretty easy to
do but did run into a small gotcha with HAProxy. I used the Rancher web UI to
accomplish all of these tasks but you could just as easily use kubectl
if you
wanted.
First I gracefully drained two of my nodes before shutting them down allowing all of my workloads to migrate onto the remaining node. The hope here was that it would allow all of the etcd pods to migrate and I wouldn’t have issues when it came back up. Once the first two nodes were down I performed a graceful shutdown on the last node before shutting down the HAProxy VM as well.
I began my restart procedure by rebooting the HAProxy VM and then rebooting the
RancherOS VMs in reverse order. As it turned out this order was a mistake. When
HAProxy started up it couldn’t see Rancher running on port 80 or 443 so it actually
failed to come up at all. The problem is that the cattle-node-agent
pods and
cattle-cluster-agent
pods were also pinging rancher.example.com
AKA HAProxy
trying to get a heartbeat.
The solution ended up being to wait for the rancher
pods to become ready and
then restart HAProxy. Once the other pods overcame their crash loop backoff timers
they were able to get the heartbeat successfully and the system came back up.
From there all that I needed to do was uncordon the two drained nodes so they were
schedulable again and I was good to go!
Conclusions
There’s still a lot of testing to do to see how fit for purpose this kind of cluster is for my workloads but the initial install was smooth and painless. Hopefully keeping it running is the same!