RAID to the Rescue
Thank G-d for RAID! And I’m not talking about the bug spray.
I have a Linux server in my basement which hosts the family email, a few websites, and all our music and photos. I built this machine 5+ years ago. By modern standards it is woefully under-powered and under-sized. But it’s been running well all these years with nary a problem. Until recently….
It all started last week when, realizing that my installation of Ubuntu 10.04 LTS (Long Term Support) had reached its end of life, and that no new updates of any kind were coming, I threw caution (gently) to the wind and upgraded to the next LTS release, v12.04.
With the upgrade came a number of software issues that I was able to resolve through several rounds of rebuilding some sources, fixing configurations, and so on. One problem lingered concerning ntp. I won’t go into the details as I am still not sure if it’s fixed, but after many attempts at solving the issue through software changes, I decided that the motherboard battery might be to blame. So I decided to change it, which of course meant shutting down and turning off the system.
The battery replacement went smoothly enough. The power-up and reboot did not. The system searched for a long time for connected devices, until it finally reported “Secondary master disk failure.” For the uninitiated, that is not good. But all was not lost!
Eventually, the system limped itself alive because the secondary master disk is part of RAID1 mirror. So the system was still bootable, in a degraded state, using only the primary master disk.
Thinking that maybe I jarred a cable loose, I checked to make sure they were all seated and tried booting again. Same result. Ok, I thought, this is a RAID set. I logged in and checked the status of my RAID sets. Sure enough, the RAID set holding the operating system was degraded.
While this entire fiasco was not what I had planned for this evening, I started thinking about what to do and when to do it. The RAID set in question is a RAID1 mirror with two 80GB disks. So, where was I going to get another 80GB disk? Turns out, I had one on the shelf that had been in another Linux server that I decommissioned years ago. Perfect.
I popped that disk in and rebooted. The system came up much more normally, and the RAID sub-system (software RAID) immediately identified both the degraded RAID set it was expecting as well as the replacement disk which apparently had several RAID partitions on it from its old system.
The next step was to get this new disk added to the degraded RAID set. Because it had some RAID partitions on it, the system created new RAID devices for them automatically. Useless to me, so I had to stop those devices before I could use the physical disk in my degraded RAID set:
[code]
% mdadm –stop /dev/md125
% mdadm –stop /dev/md126
% mdadm –stop /dev/md127
[/code]
Next I carefully located the device identifier assigned to the replacement disk: /dev/sdf.
I tried simply adding the device to the degraded RAID set, but it failed because the first partition of /dev/sdf was only 10GB. A little Googling turned up a nice trick. I clearly did not care about anything on the replacement disk, so obliterating its contents and making it look (from a partition point of view) just like the good disk in the RAID set made sense. This command did the trick:
[code]
% sfdisk -d /dev/sde | sfdisk /dev/sdf
[/code]
This dumps the partition table of /dev/sde onto /dev/sdf, essentially making the latter’s partition table (and apparent volume size) the same as the former.
Now the disk could be added to the degraded array:
[code]
% mdadm –add /dev/md4 /dev/sdf
mdadm: added /dev/sdf
[/code]
And, just as I expected, the next thing you know the RAID set is automatically rebuilding itself:
[code]
% mdadm –detail /dev/md4
/dev/md4:
Version : 0.90
Creation Time : Mon Oct 12 17:08:23 2009
Raid Level : raid1
Array Size : 76750400 (73.19 GiB 78.59 GB)
Used Dev Size : 76750400 (73.19 GiB 78.59 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 4
Persistence : Superblock is persistent
Update Time : Thu Oct 29 20:49:31 2015
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 1% complete
UUID : 98f274d3:f2707512:bf9fdff1:a347cbaf
Events : 0.3452380
Number Major Minor RaidDevice State
0 8 65 0 active sync /dev/sde1
2 8 80 1 spare rebuilding /dev/sdf
[/code]
In the time its taken to write this article, the rebuilding has completed and all is good again.
And in case you’re curious, this server isn’t limited to 80GB of RAID1 storage. No, no, no. The system includes a dedicated 4 port SATA controller which has four 500GB drives attached which operate together as a 1.5TB RAID5 array. Knock on wood, that has and continues to operate without issue.