Follow Us

It ain't smart to rely on SMART

Disk drive diagnostics not good enough.

Google research has shown that built-in disk drive diagnostics only predict about half the drive failures that occur.

Modern disk drives have a built-in self-test and diagnostic facility termed Self-Monitoring, Analysis and Reporting Technology - SMART. The drive firmware monitors a range of drive parameters, things like the number of seek errors and the disk spin-up time. If these parameters degrade over time it may indicate the unit is heading for a breakdown. With advance warning of an impending disk failure you will have a chance to move files and/or replace the unit before you lose any data.

Google's study looked at more than one hundred thousand disk drives which were a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. The observed range of annualised failure rates varied from 1.7 percent, for drives that were in their first year of operation, to over 8.6 percent, observed in their third year.

The Google researchers found that SMART diagnostics are not as useful as they are supposed to be. They note that there is little independent research into drive life and diagnostics, stating 'Most of the available information comes from the disk manufacturers themselves. Their data are typically based on extrapolation from accelerated life test data of small populations or from returned unit databases.'

They note 'detailed studies of very large populations (of hard drives) are the only way to collect enough failure statistics to enable meaningful conclusions. In this paper we present one such study by examining the population of hard drives under deployment within Google’s computing infrastructure.' Google has 'built an infrastructure that collects vital information about all Google’s systems every few minutes, and a repository that stores these data in time-series format (essentially forever) for further analysis.'

The researchers mined this data and analysed it looking for correlations between hard drive sensor and SMART readings and failure events. Their findings were:-

- Very little correlation between failure rates and either raised temperature or activity levels.
- Some SMART parameters (scan errors, reallocation counts, offline reallocation counts, and probational counts) have a large impact on failure probability. Others do not. Out of all failed drives, over 56 percent of them had no count in any of these four strong SMART signals.
- There was a lack of failure-predicting SMART signals on a large proportion of failed drives.
- Taking all SMART signals and temperature readings into account they found about 36 percent of all failed drives had no predictive failure signals at all.

Their conclusion was that 'it is unlikely that an accurate predictive failure model can be built based on these signals alone." Further "models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures."

Google's researchers hope that predictive models that 'use parameters beyond those provided by SMART could achieve significantly better accuracies. For example, performance anomalies and other application or operating system signals could be useful in conjunction with SMART data to create more powerful models.'

Google uses millions of drives so its findings should be taken seriously by the hard drive industry, also by customers implementing disk-to-disk backup systems who need to have better disk failure protection built into their D2D systems - meaning stronger RAID schemes, such as RAID 6 or DP, and more spare drives.






Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Business continuity and disaster recovery for SMBs

Business continuity (BC) and disaster recovery (DR) are major issues for all businesses, with...

Download Whitepaper

How to get your business ready for the 2012 Olympics

IT Manager: "I'm working on contingency plans to ensure that we can keep the business running...

Download Whitepaper

10 things you have to do today to protect your business in 2012

The next twelve months will be like a fair ground ride: rotation, uncertainty and mild...

Download Whitepaper

Data protection strategies in the age of the iPad

In today’s target-rich environment, CISOs must focus on defending the content of files and...

Download Whitepaper

Techworld UK - Technology - Business

Techworld Awards

Techworld Awards Winners 2011


Learn who the winners of this year's Techworld Awards are. Video footage coming soon...

Find out more
Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

Site Map

* *