Rescuing a Failed SSD
The last time my little repurposed desktop had an issue with it’s drives it turned out to be an firmware issue with Samsung SSDs not working well in Linux. This time, nothing that complex, just an SSD that had worn out with 2+ years of constant read/write traffic. The symptoms started with the server failing to boot right after GRUB tried to load the kernel and a repeating error in the boot logs about a DMA write failure for a specific disk sector. No matter how long I tried, I couldn’t get the machine to start up. Time to pull out SystemRescueCD!
With that booted up I started looking at the two drives in the system. My current
configuration is a large Linux RAID 1 with LVM over top of it. However I noticed
that the drives weren’t exactly identical, there was a small ext4
partition that
only existed on the older of the two drives. With a little more poking around I
figured out that this was the /boot
partition for GRUB on this machine and the
filesystem was very very unhappy. I tried to fsck
the drive without success:
$ fsck -yv /dev/sda1
The error I got was that while the superblock didn’t have the check flag set, there were records in the journal that could be replayed. However, when I tried replaying them I got another error saying that the operation failed due to a write error on the superblock. Next I tried looking for backup superblocks to try but got the same error for those as well.
$ sudo dumpe2fs /dev/sda1 | grep -i superblock
Primary superblock at 0, Group descriptors at 1-38
Backup superblock at 32768, Group descriptors at 32769-32806
...
$ fsck -yv -b=32768 /dev/sda1
More search pretty much turned up that the drive was likely dying and that the filesystem likely wasn’t recoverable in place. So just for curiosity I decided to test the SMART status of the drive to see what exactly had caused it to die.
$ smartctl -a /dev/sda
There it was! The available reserved space for remapping blocks was failing with
zero blocks remaining and a pre-failure threshold of 10. So looks like the amount
of work I had been throwing at this drive over the last few years finally caused
it to give up the ghost. I’m a little curious why none of my monitoring picked
up the SMART failure but that would be a quest for another time. Now it was time
to buy a new disk and try using ddrescue
again to save the data.
Since I wanted both the boot partition and the other half of my RAID 1 I installed
the new drive alongside the failing one and ran ddrescue
over the lot.
$ ddrescue -f -n /dev/sda /dev/sdb recovery.log
About two hours later I had 8 bad blocks but an otherwise successful transfer so I decided to try just fsck-ing the new disk’s boot partition and see what happened.
$ fsck -yv /dev/sdb1
Thankfully, that happily replayed the journal and cleaned up the filesystem for me. One more reboot later and I am back to happily chugging along with oVirt and running my various VMs. I will probably keep the old drive for a few weeks to make sure that everything is ok, but I think overall I am good to go!