Advertisement

Business

Regulatory Resource   Threat Intelligence      Resilient IT      Boardroom Strategies      
Resilient IT / Analytics and ROI

Product Planning Strategy

By Courtney Macavinta

No matter the industry or size of an organization, CIOs have a universal mandate: Prevent unplanned downtime. Network component failure can be a CIO's nemesis when it comes to meeting this requirement. And yet there is a simple -- though often misunderstood -- instrument that can help CIOs predict the reliability of IT systems when used correctly. The metric is Mean Time Between Failure (MTBF) and it estimates the average time a component will work without failure.

Often MTBF rates are used to market products. Vendors will boast fault-tolerant products with up to 99 percent MTBFs or use the metric to squarely compare their offerings to the competition. Yet analysts say that, taken alone, MTBF figures are not always what they seem.

"You can get a 'good MTBF' by spending a fortune and buying an incredibly expensive piece of equipment. But MTBF all by itself is just a metric. You need to couple it with some other things and put it into context to make it meaningful," says George Spafford, an IT process consultant and coauthor of The Visible Ops Handbook, a process-improvement guide for IT organizations.

Spafford and others say that with the increasing complexity of systems, CIOs shouldn't rely solely on the MTBF rate of each component in their network to prevent unplanned downtime. Instead, MTBF should be used with the following best practices in mind:

Put MTBF into context

In reality, MTBF alone can't predict a product's lifespan because the failure rate of any product will get worse as the product begins to wear out, according to Forrester Research analyst Galen Schreck in his September 2005 report, Mean Time Between Failures: The Basics. Schreck says there are several myths surrounding MTBF, including that the MTBF of the "weakest link" reflects the reliability of the whole system.

"In fact, overall reliability will be even lower than the MTBF of the weakest link because we must account for the added possibility that a separate failure could bring down the system," he writes. "Many newer devices, such as high-performance SCSI drives, advertise MTBF figures around 1 million hours, but obviously no one has tested any device over that length of time."

Focus on your service-level agreements

The one area where analysts agree MBTF is most useful is in product-cycle management. MTBF can help CIOs keep track of the longevity of systems, offering a strong clue about when to replace them. Most companies replace equipment before it's even close to being obsolete. Rather than use MTBF rates to predict product lifespan, Spafford says it's more useful when it comes to limiting downtime. That said, it's important to not just rely on vendors' rates but to instead come up with in-house metrics based on existing service-level agreements (SLA).

"Some companies can't afford any unplanned downtime," he says. "You've got to do real-world testing to come up with the appropriate MTBF benchmark. Then write your contracts with consequences based on your expectations [versus just using the vendors' MTBF rate]."

Schreck writes that organizations might also be better served by computing survival rates, not just MTBF, to analyze a device's "useful life" (using, for instance, a formula in the National Institute of Standards' Handbook of Statistical Methods). "Failure rates follow a bathtub [shaped] curve," he writes. "To get a handle on the real failure risk, firms should focus their analysis on the middle phase that encompasses the useful life of a device."

Combine it with other metrics

While an imprecise predictor of failure rates, MTBF is still a useful tool. Spafford recommends using MTBF in conjunction with other key metrics to understand the causation of any unplanned downtime. For example, also monitor the number of planned or emergency changes made to systems; changes in success rates; or environmental conditions in the data center such as temperature and voltage rates.

"The majority of problems in IT are still caused by human error," he says. "MTBF is affected by your people, press, and technology. So if you have a high rate of change and you're having issues with availability, you need to look at your change process."

Along those lines, it's equally important to establish your Mean Time to Repair (MTTR), he adds, which requires investing not just in hardware but in the training and processes necessary to prevent downtime and rapidly recover after failures.

"The technology and MTBF rate really doesn't matter if you have the wrong people and the wrong process in place," Spafford says. "In the end, it's about meeting your SLA and providing value."

Courtney Macavinta is a Silicon Valley-based business and technology writer. Her articles have appeared in CNET News, Business 2.0, Red Herring, and The Washington Post.

 

CIO Strategy Center is a daily editorial resource offering innovative insights and strategies for building an integrated, secure and resilient IT infrastructure.

Articles by Topic
Network and Infrastructure
Analytics and ROI
Strategies
Related Content
Fast Fact

"MTBF all by itself is just a metric. You need to couple it with some other things and put it into context to make it meaningful."

--George Spafford, and coauthor of The Visible Ops Handbook

Sponsor Tools
Podcast Audio Content

CIO Strategy Center is now available in audio format.

This week's feature topic is:


Risks of Wireless Email
Playtime: 8 min 23 sec



Download | Subscribe