Google finds multitude of memory errors

DRAM error rate higher than previously assumed

A study released this week by Google and the University of Toronto showed that data error rates on DRAM memory modules are vastly higher than previously thought and may be more responsible for system shutdowns and service interruptions.

The study (download PDF), which used tens of thousands of Google's servers, showed that about 8.2% of all dual inline memory modules (DIMM) are affected by correctable errors and that an average DIMM experiences about 3,700 correctable errors per year.

"Our first observation is that memory errors are not rare events. About a third of all machines in the fleet experience at least one memory error per year, and the average number of correctable errors per year is over 22,000," the report states.

Memory costs dive as DRAM vendors slash production | DRAM may remain cheap for a long time | Nanotube memory could last for a billion years | Start-up develops memory virtualisation technology

"These numbers vary across platforms, with some platforms seeing nearly 50% of their machines affected by correctable errors, while in others only 12%-27% are affected."

The median number of errors per year on a Google server that had at least one error ranged from 25 to 611.

A memory error is marked by bits being read differently from how they were originally written. Memory errors can be caused by electrical or magnetic interference or by hardware corruption.

Memory errors are classified as soft errors, which randomly corrupt bits but do not leave physical damage and can be corrected, and hard errors, in which corrupt bits (cells) within the DRAM become a physical defect that repeats data errors. Soft errors are often caused by radiation or alpha particles, which naturally occur in organic materials, including the epoxy that DRAM chips come packed in. Hard errors are most often caused by chip contamination at the manufacturing facility, but they often don't show up in testing and only surface after the memory chip warms after hours of use, according to Jim Handy, an analyst with Objective Analysis.

The Google/University of Toronto study included memory from multiple vendors as well as multiple types of DRAM (dynamic random access memory), such as DDR1, DDR2 and FB-DIMM.

The study covered the majority of servers in Google's data centres and was conducted over two and a half years, from January 2006 to June 2008.

While the study focused on servers and stated that error rates are not climbing with the latest, more dense generations of DRAM, the results show that PCs will eventually need error correction codes (ECC) technology as the size of memory chips become more and more dense, Handy said.

ECC on special chips is used to detect and correct errors introduced during data storage or transmission.

Today, DRAM uses 50 nanometer lithography technology, but is migrating to 40 nanometer technology. The smaller the bits, the more susceptible they are to soft errors due to normal levels of radiation, Handy said.

For example, while a server with error correction technology can continue to function after a soft error, a PC would need to be rebooted. A hard error would also be corrected each time a processor attempted to read from a bit on a server card, but the DRAM in a PC, because it has no error correction, would need to be replaced because it would cause a system or application using the memory to crash, Handy said.

"The study shows hard errors are more common than soft. That means modules are running and running and running in servers and every time a hard error bit is encountered, it's corrected so the memory module never gets replaced," Handy said. "If that happened to a PC user, the machine would stop working."

If an error is uncorrectable, as in the case of multiple bits exceeding the limit of what the ECC can correct, a server will shut down.

"In many production environments, including ours, a single uncorrectable error is considered serious enough to replace the dual inline memory module that caused it," the Google report read.

Handy said such problems often result in system downtime and service outages.

The study states that memory errors are expensive in terms of the system failures they cause and the repair costs associated with them. They can also open the door to security problems.

"In production sites running large scale systems, memory component replacements rank near the top of component replacements and memory errors are one of the most common hardware problems to lead to machine crashes," the report stated. "Moreover, recent work shows that memory errors can cause security vulnerabilities."


What are your views on this subject? Use the form below to post a comment on this article up to 500 characters.


Characters remaining: 500

Related SME news

Intel investor wants company bosses to pay fines

Top execs should pick up $2.7 billion antitrust tab

IBM makes brain simulator more complex than a cat

Computer brain has 1 billion neurons and 10 trillion synapses

Microsoft co-founder diagnosed with cancer

Paul Allen suffering from non-Hodgkin's Lymphoma

Lawsuit claims HP PCs are 'inherently defective'

Claimant says HP desktops lock up every time



Email this article to a friend or colleague:


PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Database security: Preventing enterprise data leaks at the source

IDC discusses the growing internal threats to business information, the impact of government regulations on the protection of data, and how enterprises must adopt database security best practices...

Download Whitepaper

Service-oriented security

SOA has become an integral part of enterprise software by providing a framework to efficiently develop software as services that is easily sharable, reusable, and integrated. No where is the need more apparent than in the Identity Management space. Welcome to the age of Service-Oriented Security (SOS).

Download Whitepaper

Data protection prospective vendor checklist

Organisations need a way to map business needs against all these challenges in procuring a technical solution. To help, SANS has developed the following Prospective Vendor Checklist.

Download Whitepaper

Unlock the power of the mainframe

This whitepaper presents the notion of CICS as an integration hub based on a component-based, service-oriented architecture supporting Web services. Highlights will review the challenges and contrasted support for Web services natively in CICS.

Download Whitepaper

Techworld UK - Technology - Business

COLT White Paper

Are all VoIP services the same?

Questions to ask your service provider to ensure you get the VoIP service you need
With careful choice of partner, your business can have all the advantages of VoIP access - reduced costs, flexibility and simplicity - without the drawbacks.
This white paper is your guide to ensure you get right the VoIP service and details the pitfalls which businesses would do well to avoid.

Download white paper
BMC

Ride the express lane in the journey to speed ITIL adoption

Explore the challenges in making the journey to ITIL and the criteria for selecting consulting services
By following ITIL practices, your IT organisation will become more closely integrated with the business. We recommend making the journey to ITIL in a sequence of six incremental steps, the phases of which are driven through execution of a strategic transformational roadmap.

Download white paper

Webcast: IT Financial Management: Cost Optimisation for Efficiency and Agility.
On Demand Webcast
Join this webcast to learn about the techniques and technologies that can help you prove the value of IT to the business by understanding the true cost of today's IT services and those that will be necessary to deliver future success.

Register Today

Site Map

IDG Network

* *