A Real Life Disaster Recovery Event

I’ve participated in many Disaster Recovery (“DR”) planning events over the years, and I generally was left with the impression that, whether the exercise was successful¹ or not, the people running the exercise were never really left with an understanding of what recovery would look like, or how complete the recovery would be. Fortunately, I only ever had to assist one company in recovery from a real disaster (more fortunately, I guess, it wasn’t my employer.)

Back when I worked at the ISP, we sold wholesale services to a bunch of other providers, including one we’ll call Zeus Internet². Zeus was a provider a few counties over that wanted to expand into our market, so we sold them co-location space, as well as wholesale T1 and DSL services for their business customers. One morning in 2003, Zeus’s headquarters, which housed almost all of their servers and network equipment, was struck by a Tornado. Directly. The building was completely destroyed, and, they feared, their equipment. Their network certainly wasn’t working at the moment, and much of the equipment was covered in mud and water. It seemed like a worst-case scenario.

Fortunately for them, they had a partial disaster recovery plan, which consisted of an alternate location, equipment, and connectivity for the telephone services³ they sold. This did not extend to any of their Internet services, which was the bulk of their business.

When they contacted us, we agreed to help them in any way we could, and suggested they bring whatever equipment they could salvage from their building to our site, and we would try to bring it back online. But we weren’t really sure what we were going to be able to do.

In our favor, we had:

We normally advertised part of their IPv4 address space as part of the co-location service, and had already filled out the paperwork with our providers to advertise their entire net block.
Plenty of physical space and power to, if not neatly and professionally, at least functionally stack a bunch of equipment up and connect to to various network circuits.
The expertise to do fairly low-level recovery of their Linux systems.

The big unknowns were:

For their T1-based Internet customers, we didn’t know if we could get the circuits re-routed to our location. While we had a SONET ring and ADM from the provider they used with spare capacity, they were in a different LATA, and we were concerned the provider, an ILEC⁴, would be unwilling to map the circuits onto our SONET ring.
We had the same concern for their DSL customers in their home county. Technically, we knew we could make it work, but we didn’t know if the ILEC would play ball or fall back on the restriction that they weren’t an IXC⁵, and therefore could not switch traffic between LATAs.
We didn’t know what, if any, equipment they could provide, and what shape it would be in. Lacking equipment, we didn’t know if they had access to usable backups.

Fortunately, our concerns about the ILEC were unfounded. They were great to work with, and once they understood the gravity of the situation, they acted quickly to remap circuits onto our SONET ring. Going from memory, they did this in less than 24 hours, which is basically at the speed of light as far as an ILEC is concerned. Through a combination of equipment that Zeus salvaged, some gear they already had in our colo facility, and some spares we had, we got all of their critical data circuits back up in the first 36 hours or so after the storm.

That left the various servers they had, almost all Linux based, to recover. This is where we spent most of our time (around three days in total to recovery around 20 systems.) And when I say recover, I mean recover. This is where the Zeus DR plan went from pretty good (telephone,) to worked out in spite of not having a real plan (WAN/Internet,) to a total miracle we recovered anything at all (the servers.) Surprisingly, it was almost a complete success.

In the afternoon after the storm, one of the Zeus engineers rolled into our colo with a pickup truck full of servers caked in mud, and in some cases, still dripping water. They had no real backups to speak of, at least not that we ever saw. Our task was to try to recover whatever data we could and bring their email, DNS, billing, and web services back online. There were two Zeus engineers there to help us, but, honestly, they were already sleep deprived and jittery on the first day, so me and another engineer did a lot of the work.

They brought a few undamaged computers with them (I think from their houses) and we had a few spares as well. We recovered the first two highest-priority servers (from memory, their primary DNS server and an email server) by disconnecting the hard drives, which didn’t appear to have any obvious damage and connecting them to spare servers we had. After some fiddling with driver differences between what their kernels were compiled with and what our servers had, and undoing some of the hacks we had put in place to provide temporary service (like NATing their primary DNS server IP to one of our caching servers, to at least provide outbound resolution for their customers) we had those back online pretty quickly.

From there, we spent the next few days trying to free up as many functioning servers as possible:

We had one system where we would connect the hard drives from the various servers and try to identify any physical problems with those. Surprisingly, there were only one or two bad drives in the bunch.
After that, we would physically clean up and dry off a given server, then power it on with as little hardware present as possible (i.e., we’d start with a minimum RAM configuration and only the graphics card,) then if that worked, we’d add the NIC, then whatever SCSI or RAID adapter they had.
If that all checked out, we’d add the hard drive, and, if it worked, put the system back in service.

We almost always ran into some kind of problem, but we recovered from almost all of them. We tried to keep the hard drives with the original server it came from, because of the aforementioned driver problem, but sometimes we had to borrow components, either from our inventory, or another of their systems, when we determined something was really broken.

In the end, we had one server (I remember it as one of their web servers) that was totally dead. No matter what we tried, we couldn’t pull data off the hard drive. There was another one where we finagled it into working, but the piezo buzzer on the motherboard was constantly on. It was annoying to be around, but it worked. Everything else, miraculously, was made to work. One server beeped constantly, but that seemed like a small price to pay given the circumstances.

They operated like this for several weeks, but were able to gradually move more and more of their stuff to the DR site they used for the backup telephone switch. Fortunately for them, unlike some of the horror stories you hear about this kind of thing, their business survived, and even thrived. They ended up building a very nice new data center, and, I believe, a substantially better DR plan.

these events always had specific success criteria, usually to prove some application or another could really be recovered, and sometimes to make the event look like a success whether anything useful was proved or not. ↩︎
Made up name. ↩︎
Think old school analog land lines ↩︎
An Incumbent Local Exchange Carrier. A legacy local telephone company. These were created, largely, with the breakup of AT&T in 1984, but there were also a lot of independent ILECs that served small markets across the US, generally with a local monopoly. The ILEC in our story was a former Baby Bell. ↩︎
As part of the AT&T divestiture, Local Exchange Carriers had the exclusive right to provide local telephone service, and Inter-Exchange Carriers (IXCs) provided long distance services, as well as the data equivalent. Basically, at the time, anything that crossed a local area boundary called a Local Access and Transport Area or LATA, had to be carried by an IXC. This was true even if you bought a data circuit from a single ILEC that was delivered in two separate LATAs. You had to pay an IXC in the middle to bridge the two circuits together. ↩︎