The last time my little repurposed desktop had an issue with it’s drives it turned out to be an firmware issue with Samsung SSDs not working well in Linux. This time, nothing that complex, just an SSD that had worn out with 2+ years of constant read/write traffic. The symptoms started with the server failing to boot right after GRUB tried to load the kernel and a repeating error in the boot logs about a DMA write failure for a specific disk sector. No matter how long I tried, I couldn’t get the machine to start up. Time to pull out SystemRescueCD!

With that booted up I started looking at the two drives in the system. My current configuration is a large Linux RAID 1 with LVM over top of it. However I noticed that the drives weren’t exactly identical, there was a small ext4 partition that only existed on the older of the two drives. With a little more poking around I figured out that this was the /boot partition for GRUB on this machine and the filesystem was very very unhappy. I tried to fsck the drive without success:

$ fsck -yv /dev/sda1

The error I got was that while the superblock didn’t have the check flag set, there were records in the journal that could be replayed. However, when I tried replaying them I got another error saying that the operation failed due to a write error on the superblock. Next I tried looking for backup superblocks to try but got the same error for those as well.

$ sudo dumpe2fs /dev/sda1 | grep -i superblock
  Primary superblock at 0, Group descriptors at 1-38
  Backup superblock at 32768, Group descriptors at 32769-32806
  ...
$ fsck -yv -b=32768 /dev/sda1

More search pretty much turned up that the drive was likely dying and that the filesystem likely wasn’t recoverable in place. So just for curiosity I decided to test the SMART status of the drive to see what exactly had caused it to die.

$ smartctl -a /dev/sda

There it was! The available reserved space for remapping blocks was failing with zero blocks remaining and a pre-failure threshold of 10. So looks like the amount of work I had been throwing at this drive over the last few years finally caused it to give up the ghost. I’m a little curious why none of my monitoring picked up the SMART failure but that would be a quest for another time. Now it was time to buy a new disk and try using ddrescue again to save the data.

Since I wanted both the boot partition and the other half of my RAID 1 I installed the new drive alongside the failing one and ran ddrescue over the lot.

$ ddrescue -f -n /dev/sda /dev/sdb recovery.log

About two hours later I had 8 bad blocks but an otherwise successful transfer so I decided to try just fsck-ing the new disk’s boot partition and see what happened.

$ fsck -yv /dev/sdb1

Thankfully, that happily replayed the journal and cleaned up the filesystem for me. One more reboot later and I am back to happily chugging along with oVirt and running my various VMs. I will probably keep the old drive for a few weeks to make sure that everything is ok, but I think overall I am good to go!