Rescuing a CentOS 8 FreeIPA Cluster

TL;DR - It is very good news that I had a disk snapshot for all three IPA servers from the same day some six months back allowing me to roll back with minimal losses. While I was able to recover the filesystems I was not able to get FreeIPA running again on the damaged systems. Moral of the story: have regular well tested backups for anything that you’d be sad to lose!

Recently my VM cluster has suffered a few…unexpected power outages related to moving into our new house which finally resulted in some disk corruption. Super unfortunately this happened to occur on my FreeIPA servers which I use to run most services in the house. The problem manifested as machines that would get stuck at a message:

Probing EDD (edd=off to disable) ... ok

The machine in question would hang at that point and provide no other output. Some searching around seemed to indicate that it was an error further along in the boot process and the message itself was a bit of a red herring. I tried editing the GRUB configuration to add loglevel=7 and systemd.log_level=debug but neither gave me any additional information. With that little to go on I took a disk snapshot in oVirt and resolved to try and fix the problem.

Some more searching suggested that an error at this point could be a corrupted initramfs and that I should try to use dracut to rebuild it. Unfortunately, since that requires a running system I needed to get a copy of CentOS up and running before I could do anything. I loaded up a CentOS 8 ISO into oVirt and attached it to one of the broken VMs. When it started I selected rescue mode from the CD options and asked it to check for Linux partitions.

On both machines the autodetect failed and I was forced to fall back to manual detection by poking around /dev. Pretty quickly I was able to identify /dev/vda1 as my primary partition and that it was an XFS filesystem by looking at another similar VM. I tried mounting the device to the mountpoint provided by the live environment:

mount /dev/vda1 /mnt/sysimage

However I got an error that the mount operation failed because “The structure needs cleaning.” This seems to be the XFS error for a corrupted filesystem so I tried xfs_repair /dev/vda1 on both VMs to see if that would help. On one the command succeeded and reported that the filesystem was fixed while on the other it reported that there were still journal entries that needed to be replayed or there might be additional corruption. I tried to mount the volume again and got the same error so I had to resort to xfs_repair -L /dev/vda1 which ignores the journal entries and just tries to perform the repair anyway. That did manage to get the repair to finish but I wasn’t sure if the files would be ok afterwards.

With that done, I decided I would try and patch up the initramfs while I already had the filesystem mounted. To enable Dracut to work properly I needed to perform some additional modifications to get the VM drive and filesystem ready.

cd /
mount -t xfs /dev/vda1 /mnt/sysimage
mount -t proc proc /mnt/sysimage/proc
mount -t sysfs sys /mnt/sysimage/sys
mount -o bind dev /mnt/sysimage/dev
chroot /mnt/sysimage
dracut --regenerate-all -f && grub2-mkconfig -o /boot/grub2/grub.cfg
exit

That took a few minutes to finish but didn’t throw any errors so I decided to try and boot the machine back up. A few minutes later and success! The machine booted and I could log in without any issue. However, when poking around at system status with systemctl status I saw two failed units on both machines. I ran systemctl list-units --state=failed to see what was the matter and saw, tragically, that it was dirsrv and ipa for my home domain. I tried to restart the IPA service with systemctl restart ipa but it failed with log messages about internal database corruption. The only recommendation I could find was to try and reinitialize the database from the one surviving server and hope for the best.

ipa-replica-manage re-initialize --from ipa.example.com

This initially failed with an inability to look up the servers own hostname with DNS. I guessed that this was because the IPA servers were configured to use themselves as DNS servers but since IPA wasn’t running there was no DNS service to reference. Checking /etc/resolv.conf I discovered I was correct as the only nameserver listed was 127.0.0.1. I modified the DNS nameservers to include the one remaining IPA server and tried the command again. Again it failed but this time with an inability to connect to LDAPS on the local host. It is a very strange design choice to make the only way to recover from database corruption be to already have the service up and running. It makes the process a catch-22 where you need a stable database on a machine to recover from an unstable database on the same machine.

The end result was that I had to revert all of the IPA machines to a disk snapshot I had taken of all of them six months ago. Thankfully there was minimal differences between the configuration then and the configuration now so I should be able to replicate everything pretty easily. After reverting I ran updates on all three machines before shutting them down and taking another snapshot as backup and that was that.