VMware Engineer Slip Caused Cloud Foundry Outage
VMware has said that human error caused a second outage to its Cloud Foundry platform-as-a-service
VMware has blamed human error for an outage that affected its Cloud Foundry platform-as-a-service offering on 26 April, saying that the outage was the result of a mistake in engineers’ preparations to avoid future service interruptions.
The 26 April outage followed a previous interruption on 25 April, which resulted from a partial outage of a power supply in a storage cabinet, according to VMware official Dekel Tankel.
The outages followed highly publicised problems with Amazon Web Services (AWS) that lasted several days and affected a number of websites. VMware’s offering is much newer than EC2, having been announced on 12 April, and is still at the beta-testing stage.
The first Cloud Foundry outage occurred at 5:45am and lasted through to the afternoon, according to Tankel.
“Existing applications were not impacted by this event and continued to operate normally,” he wrote. “The folks most impacted by this event were the developers who received their access credentials the night before. They could not log in until 3:30pm when the system health and storage connectivity was fully restored to 100 percent availability.”
While the outage is not a “normal event”, Tankel said it is “something that can and will happen from time to time.
“In this case, our software, our monitoring systems, and our operational practices were not in synch… the net result is that the Cloud Controller declared a loss of connectivity to a piece of storage that it needs in order to process many control operations,” Tankel continued. “Once the system had entered this state, it took us several hours to validate that we had no loss of data and that the storage cabinet was operating correctly and at full reliability and redundancy.”
As a result of the first outage, VMware decided to develop a procedure for detecting, preventing and recovering from such events in the future – an “operational playbook”.
“This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed,” Tankel said. “Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.”
Tankel said that during this second outage, all applications and system components continued to run.
“However, with the front-end network down, we were the only ones that knew that the system was up,” Tankel wrote. The front-end system was fully restored by 11:30am, he wrote.
The outages began on 21 April at roughly 9.40am BST, after an Amazon Web Services (AWS) data centre in Northern Virginia caused disruptions in its EC2 cloud hosting service, which in turn was said to knock thousands of websites offline.
This included the social news website Reddit; the Twitter toolbox Hootsuite; the Q&A website Quora; and the location-based social networking website Foursquare.
Whilst the problems were said to have persisted for some websites for a number of days, others came back online relatively quickly.