Follow Us

We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message

Windows Azure outage caused by configuration error

August 2 outage was caused by a 'safety valve' mechanism meant to prevent cascading network failures, Microsoft says

Article comments

A system configuration mistake caused the outage that affected Windows Azure customers in western Europe last week, according to Microsoft.

As a result, the Microsoft public cloud application hosting and development platform was unavailable for about two and a half hours on August 2. Microsoft didn't say how many customers were impacted.

At issue was a "safety valve" mechanism in the Azure network infrastructure designed to prevent cascading network failures. It does so by capping the number of connections that network hardware devices accept.

"Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity," said Mike Neil, Windows Azure general manager.

A sudden rise in the affected cluster's usage led to the "safety valve" threshold being exceeded, which generated a storm of network management alerts. "The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing these to reach 100% CPU utilisation impacting data traffic," Neil said.

At the time, Microsoft solved the problem by increasing the affected cluster's "safety valve" limits. To prevent the situation from recurring, Microsoft is patching the identified bugs in the networking hardware devices, and it is also improving the network monitoring systems, so that they can identify and address connectivity issues before they cause outages.

Forrester Research analyst James Staten said that PaaS (platform as a service) clouds such as Azure are very complex and highly automated environments, and sometimes glitches crop up in production that can't be anticipated in test environments. "This appears to be one of those cases," he said.

Over time as new features, greater use and other factors enter the equation, administrators have to take steps to adjust and optimise the running system, and occasionally something will break, he said.

"Should it be something clients should be concerned about? Not really. It is an example of the kinds of things that can happen in a cloud environment. But far worse things are more common in a typical enterprise data center," Staten said.

IT chiefs and developers planning to host applications in the cloud need to configure them and design them to be fault tolerant. "That is a fundamental shift in thinking most developers and enterprise operations teams need to understand when embarking on cloud deployments," he said.

"These types of outages are learning opportunities for both the cloud admins and cloud customers. Rather than view these incidents as indictments of cloud, they should be seen as opportunities to improve your use of the cloud," he added.



Share:

More from Techworld

More relevant IT news

Comments



Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Choose – and Choose Wisely – the Right MSP for Your SMB

End users need a technology partner that provides transparency, enables productivity, delivers...

Download Whitepaper

10 Effective Habits of Indispensable IT Departments

It’s no secret that responsibilities are growing while budgets continue to shrink. Download this...

Download Whitepaper

Gartner Magic Quadrant for Enterprise Information Archiving

Enterprise information archiving is contributing to organisational needs for e-discovery and...

Download Whitepaper

Advancing the state of virtualised backups

Dell Software’s vRanger is a veteran of the virtualisation specific backup market. It was the...

Download Whitepaper

Techworld UK - Technology - Business

Innovation, productivity, agility and profit

Watch this on demand webinar which explores IT innovation, managed print services and business agility.

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

From Wow to How : Making mobile and cloud work for you

On demand Biztech Briefing - Learn how to effectively deliver mobile work styles and cloud services together.

Watch now...

Site Map

* *