« Randy Pausch Quotes | Main | Change »

Tripping on NeoPhi

As most regular readers of this blog know my housemates and I are having the house de-leaded. This unfortunately means temporarily vacating part of the house while it is worked on. Thankfully despite the age of Gilman Manor it had very little lead. One area that did have lead was near where NeoPhi lives. This of course necessitated moving NeoPhi. That went smoothly and should have only been seen as a short network blip and NeoPhi's and its UPS moved smoothly into another room. Alas the short blip turned into a two hour nightmare for me as once in its new home while trying to get something I forgot I was going to need I knocked NeoPhi's power cord out from the UPS.

Needless to say it wasn't happy. The fsck during reboot was taking an extremely long time and I thought that it had died on me. When I was about to give it it started beeping horribly. A sound I'd never heard before and hope to not hear again. In a panic I killed the power again, the beeping stopped. I turned it back on and during the RAID firmware startup the beeping started again. After another quick reboot I went into the RAID BIOS. It was in the middle of a rebuild and the alarm was the indication that bad stuff was going on. No problem. While I could have let the system startup and run in a degraded mode while rebuilding I figured it was best under the circumstances to finish the rebuild in the BIOS.

That looked like it was going to take about an hour to do. Fine I finished moving stuff around, read my email, and worked on other stuff to pass the hour. Upon returning it was stuck at 90.1% complete on the rebuild. I waited, still at 90.1%. I waited some more. Finally after 10 minutes is ticked up to 90.2%. I'm like this can't be good. I went into the the event log and it was slowing filling up with Read Error on Channel 4. Great. Thankfully since I'm running RAID 6 I wasn't too concerned about lost data. Maybe just some corrupt data due to the power loss.

Looking around the BIOS some more though I couldn't find anything about my hot spare drive. Instead I found 2 RAID sets one which was incomplete and the other had only 4 disks (one of which was my hot spare). Guess this is why they highly recommend buying the battery backup for your RAID controller. Given how long the read failure for drive 4 were taking I took an unorthodox step and just yanked its cable from the controller. The event log quickly reported that the device had been detached and in a couple of minutes the last 10% of the rebuild finished.

At this point I was able to boot up in single user mode and ran a manual fsck on all the partitions. I got some really nasty errors from fsck that I've never seen before. A little research said there wasn't much hope of really fixing them so I just said Yes and let fsck do its thing. Once all disks passed fsck cleanly I rebooted again. I deleted the single drive that was part of the incomplete RAID set (which was really one of the primary drives of the original RAID set), added it as a hot spare and rebooted again.

NeoPhi is now back up and running. While I type this the RAID is doing another rebuild in the background and I have the one drive that was getting read errors sitting next to me soon to be replaced with a new drive from Newegg. Unfortunately along the way I again learned how limited RAID control support is for OpenBSD as I couldn't even use the alarm silence function of bioctl to shut the damn thing up. I must say though that the Areca support under OpenBSD at least exists, unlike my previous 3Ware card. Now I can at least get the status of the rebuild even if some of the other features like seeing the event log don't seem possible.

I'll probably do a reboot when the new drive comes in since I don't think hot plugging an internal drive is the best thing to do, even if I did unplug one that way. The thing that gets me the most about this little incident is a friend of mine had serious problems with his VPS at Dreamhost. Seems like no matter what you do eventually hardware or software failures will catch up with you.

Update: Seems the network port and or cable that I plugged NeoPhi into last night degraded over the night or possibly something got fried when the house lost power for 20 minutes this morning. In either case, slightly before the construction crew got fully setup I swapped out the cable and that seems to have made the network happier.

Tags: gilmanmanor neophi

Comments

I actually learned something important from neophi going down as well - a PHP script using CURL will hang for a very long time trying to contact a server that is not reachable. My personal blog aggregator scripts spent a whole lot of time spinning while Neophi was down :) Glad you got everything fixed, and good to see you back online again.