How costly is a bug caught in the field? 1,000 times costlier than finding it early

How costly is software failure? As blogger Chris Hobbs noted recently at electronicdesign.com, the cost of fixing a bug caught in the field is a thousand times greater than fixing the same bug during the requirements phase.

And, he writes, requirements are the source of 56 percent of all system defects. Those sobering facts are the launching pad for a discussion about how to improve the quality of software before its release, versus chasing bugs after release, as is so common today. 

“Failure rate runs high following the initial release because customers use the product in ways the developers didn’t anticipate and therefore didn’t verify,” Hobbs says, noting the familiar “bathtub curve” where failures are high at the time of release, drop through a period of stability and then start to rise again as the environment changes. Then comes a new release, and the bathtub cycle repeats. 

In effect, what many companies do is to use the field as a test bed, and their customers as guinea pigs. And this passes as an acceptable customer-oriented approach? Only with software.

Can you imagine a food manufacturer rolling a new variety of egg rolls or snack cakes into general distribution before scores of field tests? Or a restaurant changing its menu without using a test kitchen? Can you imagine an aerospace manufacturer hammering out glitches on a new design by taking up aircraft full of human beings?

No way. Such a thing only happens with software.

Hobbs explores the thinking that leads to this faulty approach (example: the conflation of “availability” and “reliability”) and how software developers can decrease the odds of failure in critical systems when pushing out upgrades. It’s a highly recommended read. 

HeathCare.gov: What we can learn from a colossal failure 

The federal health care insurance website HealthCare.gov was a big flop when it debuted last year. While the site has stabilized and improved since, it remains the subject of examination for lessons on the development of large, complex systems. 

Over at InformationWeek, Anders Wallgren pulls apart the role of continuous delivery and diagnoses what hindered the success of HealthCare.gov. Among the causes he pinpoints: 

  • Too many control points: 55 contractors worked on the site, all operating under a tight timeline and communicating with multiple agencies. Teams need the ability to work independently while maintaining complete communications, he says.
  • Inadequate testing: This was the big issue with the site, and Wallgren isn’t the first to point it out. The site was virtually untested until just before its unveiling. Big, big mistake. Approaches like Service Virtualization make it possible to Always Be Testing (ABT). 

Check out Wallgren’s post for more. 

Speaking of outages, here’s a scary one 

Speaking of government and critical systems, FCC chairman Tom Wheeler recently addressed a software-related 911 system outage that resulted in nearly 6,000 emergency calls in seven states from being answered during a six-hour period in April. The cause, of course, was described as a “software glitch.” 

Wheeler described the report as “terrifying.” 

“We’re pro-new technology, but we’re even more pro-public safety,” Wheeler said, according to urgentcomm.com. “The presentation makes it clear that the transition to all-IP networks is going to create a series of challenges such as this. An outage as the result of a hurricane or flooding is one thing.

“An outage as the result of bad code is completely something else. … This is a call to action.”

The outage in Washington state, half of Minnesota and parts of five other states was cause by an arbitrary cap on the number of 911 calls that could be handled by a central call center in Englewood, Colo., and what blogger Donny Jackson called “questionable actions” by vendor Intrado.