Rescuing a CentOS 8 FreeIPA Cluster
TL;DR - It is very good news that I had a disk snapshot for all three IPA servers from the same day some six months back allowing me to roll back with minimal losses. While I was able to recover the filesystems I was not able to get FreeIPA running again on the damaged systems. Moral of the story: have regular well tested backups for anything that you’d be sad to lose!
Recently my VM cluster has suffered a few…unexpected power outages related to moving into our new house which finally resulted in some disk corruption. Super unfortunately this happened to occur on my FreeIPA servers which I use to run most services in the house. The problem manifested as machines that would get stuck at a message:
Probing EDD (edd=off to disable) ... ok
The machine in question would hang at that point and provide no other output.
Some searching around seemed to indicate that it was an error further along in
the boot process and the message itself was a bit of a red herring. I tried
editing the GRUB configuration to add loglevel=7
and systemd.log_level=debug
but neither gave me any additional information. With that little to go on I took
a disk snapshot in oVirt and resolved to try and fix the problem.
Some more searching suggested that an error at this point could be a corrupted
initramfs and that I should try to use dracut
to rebuild it. Unfortunately,
since that requires a running system I needed to get a copy of CentOS up and
running before I could do anything. I loaded up a CentOS 8 ISO into oVirt and
attached it to one of the broken VMs. When it started I selected rescue mode
from the CD options and asked it to check for Linux partitions.
On both machines the autodetect failed and I was forced to fall back to manual
detection by poking around /dev
. Pretty quickly I was able to identify
/dev/vda1
as my primary partition and that it was an XFS filesystem by looking
at another similar VM. I tried mounting the device to the mountpoint provided
by the live environment:
mount /dev/vda1 /mnt/sysimage
However I got an error that the mount operation failed because “The structure
needs cleaning.” This seems to be the XFS error for a corrupted filesystem so
I tried xfs_repair /dev/vda1
on both VMs to see if that would help. On one
the command succeeded and reported that the filesystem was fixed while on the
other it reported that there were still journal entries that needed to be
replayed or there might be additional corruption. I tried to mount the volume
again and got the same error so I had to resort to xfs_repair -L /dev/vda1
which ignores the journal entries and just tries to perform the repair anyway.
That did manage to get the repair to finish but I wasn’t sure if the files
would be ok afterwards.
With that done, I decided I would try and patch up the initramfs while I already had the filesystem mounted. To enable Dracut to work properly I needed to perform some additional modifications to get the VM drive and filesystem ready.
cd /
mount -t xfs /dev/vda1 /mnt/sysimage
mount -t proc proc /mnt/sysimage/proc
mount -t sysfs sys /mnt/sysimage/sys
mount -o bind dev /mnt/sysimage/dev
chroot /mnt/sysimage
dracut --regenerate-all -f && grub2-mkconfig -o /boot/grub2/grub.cfg
exit
That took a few minutes to finish but didn’t throw any errors so I decided to try
and boot the machine back up. A few minutes later and success! The machine booted
and I could log in without any issue. However, when poking around at system status
with systemctl status
I saw two failed units on both machines. I ran
systemctl list-units --state=failed
to see what was the matter and saw, tragically,
that it was dirsrv
and ipa
for my home domain. I tried to restart the IPA
service with systemctl restart ipa
but it failed with log messages about internal
database corruption. The only recommendation I could find was to try and reinitialize
the database from the one surviving server and hope for the best.
ipa-replica-manage re-initialize --from ipa.example.com
This initially failed with an inability to look up the servers own hostname with DNS.
I guessed that this was because the IPA servers were configured to use themselves as
DNS servers but since IPA wasn’t running there was no DNS service to reference.
Checking /etc/resolv.conf
I discovered I was correct as the only nameserver listed
was 127.0.0.1
. I modified the DNS nameservers to include the one remaining IPA
server and tried the command again. Again it failed but this time with an inability
to connect to LDAPS on the local host. It is a very strange design choice to make
the only way to recover from database corruption be to already have the service up
and running. It makes the process a catch-22 where you need a stable database on a
machine to recover from an unstable database on the same machine.
The end result was that I had to revert all of the IPA machines to a disk snapshot I had taken of all of them six months ago. Thankfully there was minimal differences between the configuration then and the configuration now so I should be able to replicate everything pretty easily. After reverting I ran updates on all three machines before shutting them down and taking another snapshot as backup and that was that.