NeoPhi Breathing Again
On Tuesday the 15th upon arriving at work I got a monitoring email alert message saying that my blog's RSS feed file wasn't accessible over the Internet. Given some of the networking issues RCN has been having lately, I figured this was another of them. However, this time it turned out not to be the case as my housemate was showing up on GMail chat which meant our home network was still online. I had him try and reboot the server as it has wedged in the past due to kernel panics. Unfortunately after a few attempts it just wouldn't startup. No video output, no nothing, just a repeating cycle of two short beeps every few seconds.
I was slated to head to a friends house after work to check out his iRacing setup but I figured I'd swing by home first to get a better idea of what troubleshooting would be ahead. It seemed though that I was just warming up for a bad day as a few blocks from my house my back tire went flat. Turned out I had gotten a carpet tack squarely wedged in it. No puncture resistent tire can quite handle that. To top it off my bicycle pump started freaking out while trying to fill up my tire. I still have no idea what was up with that. In any case I did finally get my flat tire fixed but made no progress with troubleshooting my system.
Since I wasn't sure how long my system would be down, I switched over email accounts that were being delivered directly to my server to my GMail Apps hosted domain and headed out for some iRacing fun. Man did I suck at that. Didn't realize quite how much the seat of the pants feel when driving factors into car control.
After waking up Wednesday morning I noticed one of my typical daily morning emails wasn't there. I realized that with my server off the net since yesterday morning by secondary DNS server was no longer picking up SOA records, which of course meant there was no MX record for the domain, hence no email. I tried to promote my secondary DNS at easyDNS to become primary but became completely frustrated with the UI and inconsistent help. I bailed on them and setup new primary hosting service with DNS Made Easy, which was a snap. A couple of hours later my whois record reflected the new DNS servers and mail started flowing again.
After work I bailed on what looked like an awesome BFPUG talk to spend more time troubleshooting. No dice. My motherboard manual indicated that a series of short repeating beeps meant a power issue. The power supply checked out fine and removing the power load by unplugging a few of the RAIDed drives didn't help. In the meantime I decided that even if I could get the server up it was time to raise the priority of virtualizing it. Adding fuel to the fire maybe wasn't the best thing to do but it was something I had wanted to do anyway.
Having used Amazon's EC2 environment extensively at my last job I knew I could get a new server up and running quickly. Combined with RightScale, 30 minutes later I had an EBS volume attached to a generic RightScale server template and started working on getting my data transfered and services restored.
On the data backup side I've used my MacBook Pro as an onsite backup for my server and used TimeMachine and Jungle Disk to backup that data along with the rest of the files on my laptop. Worse case scenario if the old server was completely dead and all the data was lost, I'd lose at most 9 days, that being the time since the last backup. While I'd verified that I could restore files back to my Mac, I had never tried using Jungle Disk to get them directly back onto a Linux server... I think everyone can see where this is going :) It should have worked and the support ticket I opened with Jungle Disk indicated that it was possible, but despite trying for a couple of hours I couldn't get there graphical setup tool to work over X11 port forwarding (the version of Jungle Disk I'd used to make my backup wasn't Linux command line compatible). As a result I started the much longer process of using my home Internet connection to start pushing files up to my new server, instead of pulling them directly onto my new server from S3 which is where I have my Jungle Disk backup.
Thursday after work I made a run to MicroCenter to procure a new motherboard as everything indicated that was the most likely failed component. Turns out my server is at least 3 generations behind the times and MicroCenter had nothing that I could plug my existing components into. I looked at picking up both a new motherboard and CPU but given I was going virtual this felt like a waste of money. I left empty handed knowing that I was probably going to end up with a few days of lost data.
When I got home I checked that my transfer was still running and was disheartened to see that based on how much rsync had already transfered the expected completion time was 7 days. It then I remembered that when I first ran JungleDisk I had to leave my laptop on a week straight for it to finish the first backup. While updating people on the status of the server at game night as we put a take-out order together I had the idea that I could use a Windows instance running on EC2 to run JungleDisk and restore my files which would then be an inter Amazon transfer onto my new server. Post games RightScale's Windows Server Template allowed me to get a server up and running in about 30 minutes. I had a few hiccups and unexplained operational states during that process though. I'm going to attribute most of the oddness to the EBS issue Amazon was experiencing during that time. With Remote Desktop Connection I was able to control my new Windows instance (remember to update your security group to allow the RDP protocol in), install Jungle Disk software and start the restore. Estimated at 12 hours this was much better than transferring from my local machine.
The restore was done by the time I woke up on Friday at which point I started the transfer to the new server. Just to review the path this data has taken: Linux to Mac to S3 to Windows to Linux. The last bit there pretty much meant any user/group/permission settings that existed on the original Linux had been lost. Turns out some important information had also been lost on the original Linux to Mac step. By default the Mac filesystem is case preserving but case insensitive, so whereas on Linux two directories called Mail and mail are separate, only one of those wins when landing on the Mac. This unfortunately meant that getting the original server backup was a requirement now otherwise any colliding data when I made the original backup from Linux to Mac would be lost.
Post work on Friday I grabbed a much needed drink with friends at Green Street Grill and when I got home worked on getting the correct users and groups setup on my new system so that I could fix the user/groups/permissions on the subset of files that had been restored. I went to bed with plans to trek out to MicroCenter on Saturday to buy a cheap desktop that I could use to pull the data off the old server.
I woke up Saturday morning long before MicroCenter would open so I decided to take one last stab at trying to fix the problem. After ripping out the RAID card which I really hoped wasn't the problem since it cost as much as the rest of the computer, the system didn't beep when I turned it on. Looking into the troubleshooting guide for RAID controller it turned out the beeping was the "bus reset signal" (which I've not bothered to lookup exactly what that means). Needless to say with this new piece of diagnostic information I went back to searching the Internet and ran across a few posts about my motherboard not POSTing being related to bad memory. I unseated two of the DIMMs with no change in behavior. When switching to having the other two unseated the system started and the BIOS setup utility kicked in.
Bad memory all along. Which might have explained some of the previous kernel panics I'd seen in the past. I plugged the RAID controller back in and everything still booted. Alas when I went into the hardware RAID setup only 2 of the 5 drives showed up. Given that I run RAID 6 with a hot spare those two drives were enough to ensure that I had no data loss. Looking in the case again I noticed that in the course of mucking with the memory and the general tight space of the server I'd knocked lose one drive's power, another drive's SATA cable, and lastly one drive's cable from the controller. Thankfully everything was hot-pluggable and within a minute the RAID controller saw everything.
One very long reboot later (fscking is slow) my old server was up at running, albiet with half its memory gone. Having committed to making the virtualization switch I started up the process of syncing the missing data up to the new server. Thankfully with the bulk of the data already there the deltas didn't take long. The rest of Saturday was spent getting MySQL and Apache running again and configured correctly to handle the fact that the root filesystem is ephemeral. For a user machine like this I could technically use an EBS backed root filesystem, but that then means I can't quickly boot a new instance that has a fresh and clean OS on it. Nor does it really speak the real goal of virtualization which is to be able to spin up new cloned instances to handle increased traffic, but I don't expect that to happen anytime soon with my server.
Needless to say 5 days later with a few late nights thrown in and having spent the previous weekend at No Fluff Just Stuff conference, I'm ready for next weekend already.