Atlanta Data Center RFO (February 24)
Valued Cyber Wurx Customer,
Please see below for RFO related to the utility power outage on February 24th, 2017.
RFO (Reason for Outage) Report
Event Description: Utility Power outage on 800A 16.E1
Event Start Time: February 24th, 2017 11:30 am EST
End Time: February 24th, 2017 11:45 am EST
At roughly 11:30 am EST on February 24th, 2017 Cyber Wurx landlord allowed a vendor to perform maintenance on a generator that feeds Cyber Wurx facility. This maintenance activity was not approved by Cyber Wurx and there were no notifications of this maintenance window.
During this maintenance the vendors technician caused the generator to malfunction leading to utility side of power to drop. The technician had the generator set in manual mode for maintenance thus not allowing the generator to sense a utility power outage.
The 16.E1.1 UPS held facility power for roughly 9 minutes at which point it began running out of battery power. This initiated a rolling power outage to one segment of the facility since the emergency generator was not actively online. About 1 - 2 minutes after this, the building vendor realized what happened and restored utility power.[...]
Future Preventive Action
Cyber Wurx is working diligently to address the utility power issue with the landlord and will work to insure this type of activity never happens in the future. This includes, improved alerting of utility power outages, and improved notifications from the landlord of vendor activity. We are also pursuing engineering feedback to enhance the switching between utility power and emergency power.
For RamNode customers specifically, the event included hard reboots on servers in several of our cabinets and multiple router reloads to fix a configuration bug encountered after the power loss. We ended up needing to manually re-add all our IP subnets to restore IPv6 connectivity. This caused several brief periods of packet loss after the actual power outage event. Several of the rebooted servers required fsck's to restore service, one booted into the wrong kernel, and five others needed full RAID card replacements. We were able to easily replace the cards in ATLSVZ9, ATLCVZ4, and ATLCVZ5, but ATLSKVM3 and 4 had legacy configurations that would not work with our remaining spare cards. We ended up having to move the SSDs to a spare Standard E5 server and then migrate customer data from there to newer spare Premium E3 servers running our current configuration. Fortunately, only small amounts of cached data were lost in the process and all servers successfully recovered from the unexpected outage. Although this event should never happen again, we will continue our evaluation of what we could have done better and how we can better prepare for an outage in the future.
Please make sure you follow/bookmark https://twitter.com/NodeStatus for network news and updates.
Monday, February 27, 2017