On June 29th 2012 a severe windstorm reffered to as a derecho tore through the Midwest and MidAtlantic regions
of the US. Over 1,750,000 homes and businesses were left without electricity. Datacenters supporting Amazon's AWS, Netflix and other large organizations were taken offline, and there were several deaths reported.
The story that follows offers some lessons relearned and possibly a few new ones.
I work for a company with a NOC and primary data-center in the path of the storm. A number of events took place. With day time temperatures near 108F and the windstorm coming through the battery on the backup generator powering the data-center cracked and was not able to start the generator. Notifications went out but due to hazardous road conditions no one was able to get onsite to address a clean shutdown of services. Remote access was offline as UPS batteries provided insufficient time. This is a known factor as we rely on the backup generator to operate. The generator is tested weekly, the test the day prior was ok, and the battery maintenance was performed on the same day as the test. But a generator that does not start when needed is no generator at all.
Power was restored only a few hours later which compounded the problem. The power came back before the first admin could safely get on site. When he did he found all systems powered on but none of the systems were reachable. The environment is highly virtualized, with a well-designed and thought out set of VM hosts and systems. The VM hosts are connected to redundant switches with redundant connections. However when power was restored the servers came online before the switches did. The VM hosts deactivated the NICs and prevented local communications. It looked at first glance like a NIC or Vswitch failure. A simple shut/noshut on the switch ports resolved this ultimately. Additionally services such as DHCP servers, AD servers, and RADIUS servers are all VMs. None of which were available. IP subnets were not documented in an emergency manual, nor were some key passwords for access to switches when RADIUS is down. They were all documented but not in an easy to locate emergency manual. A few phone calls resolved each of these situations with each taking additional time and delaying recovery.
The failure boiled down to the VM server not bringing up the NICs properly since they were up before the switches, this was then compounded by a sysadmin assuming the problem was a VM problem and beginning to reconfigure Vswitches (at the direction of a VMWare tech support technician). Once all parties were onsite resolution was fast and a complete recovery was obtained.
So on to old lessons learned – geographic redundancy is desirable, document everything in simple accessible procedures, some physical servers may be desirable, such as DHCP, and AD. Keys services such as RADIUS must be available from multiple locations. Securely documenting addresses and passwords in an offline reachable manner is essential as well as documenting system startup procedures.
Some new to me lessons learned are a little more esoteric. Complacency is a huge risk to an organization. Our company is undergoing a reorganization that is creating a lot of complacent and lackadaisical attitudes. It is hard to fight that. We are losing good people fast and hiring replacements very slowly. There is no technical solution to this problem. It puts a lot of pressure on individuals. I had not experienced a battery exploding in the past. Though I am finding that at least on this day it was a common event. I have learned of three or 4 similar events that same evening. Inter-team communication is a constant struggle. We all work well together but do not have a well-orchestrated effort to create and document our procedures across team boundaries. Lastly having a clearly identified roster of who to call for what problem when is a must and it must not be electronic. Much of our roster was not available and calls were made to people who may not have been the correct on call person for a group, and then personal relationships took over as the way to get things done. It worked, but is not an ideal scenario.
While I am not proud to be in the company of such giants as Amazon and Netflix I am glad that we restored service 100% in only a few hours and had no loss of data and business was not hurt by the event. I am sure I will identify more specific and achievable lessons from this event.
Please share your stories about this event, or lessons you learned in a recent event.