So this is going to be one part debugging/tutorial and one part rant/request for help from the greater world. Hopefully it’s helpful to the next poor soul who decides they want to try running a home lab on oVirt using whatever they have to hand.

The Inciting Incident

This particular saga actually comes after another weird oVirt error that I wasn’t able to figure out and during the course of a reinstall of the new node. What triggered this entire experiment is after an upgrade from oVirt Node 4.3.7 to 4.3.8 I was trying to modify firewall rules through the Cockpit interface and, while saving the rules, firewalld must have crashed or something. The effect was all of my rules were wiped out and the firewall scrambled. While getting the rules back wasn’t too hard since I had physical access to the machine, some part of the auto-configured oVirt firewall rules must have been lost and not come back correctly. This manifested as VMs being unable to get DHCP addresses when they came up on the host in question. Interesting part is that I could watch the DHCP server get the request from the new VM, assign an IP and ARP it back. But on the client end … nothing. Even more bizarre was that VMs provided with static IPs didn’t seem to be bothered by it at all. Services seemed to still behave as expected and they were fully able to access the rest of my home network. I even tried a reinstall/reconfigure through the oVirt management interface but no dice at all. Since thankfully this was the new node, I went to the ultimate fallback: blow it away and reinstall from scratch. Sneak preview: this won’t be the only reinstall during this process.

The First Attempt

I started off fresh with a brand new copy of oVirt Node 4.3.8 burned onto a USB stick. I booted up the new machine and ran the install, being sure to only configure the single management interface with a static IP while leaving the rest disabled. Once the install finished I rebooted and jumped on the Cockpit interface for the new machine. I mounted up an extra drive to use for NFS storage and configured that before going into the firewall configuration to open up the necessary ports. Everything seemed to be going swimmingly until I went ahead and added a bonded adapter combining the remaining 3 NICs on my HP DL360p Gen8 into a single XORed interface. The bond completed without a hitch but instead of what I had seen previously, with the bond remaining inactive until an oVirt network was assigned to it, the bond immediately pulled a DHCP address and set itself up. I assumed this wouldn’t be a problem.

Next I hopped over to the oVirt management interface and added the new host. I told it to a basic install and to go ahead and configure the host firewall as well. I skipped installing the self-hosted engine for the moment because while it seems possible to move the hosted engine into a new cluster using the CLI and maintenance mode as seen here it isn’t possible if the two clusters use different CPU architectures because there’s no way to tell the hosted engine to change from Intel virtualization to AMD or vice versa. It also isn’t possible to have a mixed architecture cluster as described here in the documentation. The cluster I had preconfigured for the new node (and any future Intel hosts) required two virtual networks to be configured so when the base install finished my node was left in an unavailable state.

I went ahead and edited the host to assign the correct networks but after editing saw an error in the log that the system was unable to sync the host’s networks and the host returned to unavailable. Going back into the configuration I clicked on “Save Network Configuration” and all hell broke loose. Almost instantly the node was knocked offline and I was only able to check on it through the physical terminal. While ip addr seemed to show everything ok and systemctl status didn’t show any failed services I was unable to even SSH into the machine remotely.

Laking a better plan I tried a reboot as well as fully removing the bond configuration from the server. Neither seemed to have any effect, I’d have a working machine for a few minutes post reboot then it would go offline and stay that way. Again, luckily, this is a secondary node so reinstall we shall!

The Second Attempt

Round two started with the same OS install and configuration as before, all very straightforward. I added the node back into oVirt and used the oVirt management interface to create my NIC bond and assign it to the data network. This seemed to work correctly and the node happily reported as online. The next part of the process was to try adding firewall rules through Cockpit to allow NFSv3 and v4 traffic to the node. This is where it all went wrong again.

I tried adding all the filewall rules I needed (NFSv3, NFSv4, mountd, rpcbind) through Cockpit at the same time. However, when I clicked to save the rules everything was deleted, except the NFSv4 rule, and I was promptly kicked off all of my remote sessions. It seems clear now that adding multiple firewall rules through the Cockpit interface is the culprit. In attempt three we will try adding them one by one to see if we get a different result.

The Third Attempt

The third attempt turned out to be the charm with the same installation steps and configuration in oVirt. This time I added the firewall rules one by one instead of in a group and Cockpit/firewalld was much happier with that setup, successfully adding all the needed rules. I tried booting up a new test VM and trying to get a DHCP lease with no success. I double checked the firewall rules and saw that IPv4 DHCP wasn’t explicitly allowed through the firewall and it was applied to my bond as well as the two VM networks. I reconfigured the firewall to allow it through and still no luck. I could see the requests going out, the DHCP server responding to them, and then nothing.

In desperation I remembered that the last change I had made to the host before things started to go wrong was to implement that XOR bond I have been talking about on 3 of the 4 available NICs. Before that I was routing all of the VM data traffic through a single 1Gb NIC and I had hoped that by bonding them together I would be able to get better performance. I had even gone with the Mode 2 XOR bond instead of the recommended Mode 4 802.3ad bond method because I knew the switches I have deployed didn’t support the more advanced link aggregation. I had also checked the oVirt network documentation and made sure that Mode 2 was compatible with VM networks before setting it up. I went ahead and updated the host configuration to just use a single unbonded NIC for the data network and all of a sudden everything was happy again. Several test VMs picked up DHCP leases and were able to route traffic throughout my home network. It seems that something along the path between that server and the router does not agree with that bond method.

Lessons Learned

  • You can’t mix Intel and AMD in a single cluster
  • A self hosted engine that starts on a particular architecture can never switch off of that architecture.
  • Don’t create a bond through the Cockpit interface, add it through oVirt or it won’t sync networks correctly.
  • Don’t try to add multiple firewall rules through the Cockpit web UI.
  • There is something hinky with an Mode 2 XOR bonded NIC configuration and the networking equipment/configuration that I have which makes DHCP not work as it should.
  • When something breaks, always try undoing the thing you changed last as your first debugging step!

Conclusions

While oVirt gives a lot of bang for your buck as a homelab developer it still has a few sharp edges to it. You will have a better time if you are running a unified CPU architecture or if you are deploying primarily to repurposed enterprise gear. While I’ve been able to make my old desktop work well enough for experiments I’ve also had more than my fair share of hiccups when consumer hardware did not perform to the level expected. I still have a lot to learn and as I substitute more enterprise level gear into my setup I expect things will go more smoothly. The next big purchase is probably a managed switch that is capable of supporting port mirroring and other fun advanced settings.