Follow Us

We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message

Outage caused by single admin mortifies cloud provider Joyent

Joyent is looking at how to improve software and operational procedures to prevent a reoccurrence

Article comments

Cloud provider Joyent suffered an outage on Tuesday after an administrator was able to simultaneously reboot all virtual servers hosted in the company's US-East-1 data center.

"It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," said Bryan Cantrill, CTO at Joyent, in a post on Hacker News.

The company first noticed something had gone wrong when it started seeing transient availability issues.

"Due to an operator error, all compute nodes in US-East-1 were simultaneously rebooted.  Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time," Joyent said in an initial update on the issue.

About an hour later after first reporting the problem, the company said that all compute nodes and virtual machines were back online.

Joyent didn't say how many customers or servers were affected by the reboot. However, an error of this magnitude shouldn't be allowed to happen, and highlights the importance of processes that balance the need for effective management and protecting users against these kinds of issues.

"As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are and will be making," Cantrill wrote.

The company is looking at how it can improve software and operational procedures to ensure that this doesn't happen in the future, and also how the recovery after a failure can be made smoother, according to Cantrill.

Just like any IT system, cloud-based services and servers can suffer from outages, but because the large number uses consequences are usually larger.

This week some Amazon Web Services users were hit by a power outage. Servers in one of the US-West-1 region's availability zones were affected, and it took almost three hours for Amazon to recover all instances. Amazon didn't elaborate on what caused the power failure.

Recently, Twitter also suffered an outage after a change to one of its core services went wrong, and HBO angered users of its Go service twice after it was overwhelmed by the number of people that wanted to watch the season premiere of "Game of Thrones" and the finale of "True Detective."

Send news tips and comments to mikael_ricknas@idg.com



Share:

More from Techworld

More relevant IT news

Comments



Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Choose – and Choose Wisely – the Right MSP for Your SMB

End users need a technology partner that provides transparency, enables productivity, delivers...

Download Whitepaper

10 Effective Habits of Indispensable IT Departments

It’s no secret that responsibilities are growing while budgets continue to shrink. Download this...

Download Whitepaper

Gartner Magic Quadrant for Enterprise Information Archiving

Enterprise information archiving is contributing to organisational needs for e-discovery and...

Download Whitepaper

Advancing the state of virtualised backups

Dell Software’s vRanger is a veteran of the virtualisation specific backup market. It was the...

Download Whitepaper

Techworld UK - Technology - Business

Innovation, productivity, agility and profit

Watch this on demand webinar which explores IT innovation, managed print services and business agility.

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

From Wow to How : Making mobile and cloud work for you

On demand Biztech Briefing - Learn how to effectively deliver mobile work styles and cloud services together.

Watch now...

Site Map

* *