Grizzaffi: Dealing with ‘Failure Fatigue’ is a Matter of Automating Test Cycles

I don’t know whether psychologists have a term for what happens to software developers and testers when things frequently go wrong. Paul Grizzaffi calls it “failure fatigue.” 

When this condition sets in, he says, people become desensitized to chronic and often intermittent issues due to the amount of effort it takes to address those problems. This desensitization can lead to additional defects being released into new software, he says. Failure fatigue can come from several sources, including product defects, automated test script defects, deficiencies in automated testing tools and process deficiencies, he says. 

Grizzaffi, an automation program architect and manager for revenue cycle technology at MedAssets, discussed failure fatigue and other testing-related issues last weekend at a DevOps Live conference in the Dallas area. 

Here is the first of a two-part chat I did with Grizzaffi recently on his DevOps Live presentation and other things. The only caveat he gives is that all opinions he expresses are his own, not those of any employer he has ever had. On Twitter, you can follow Paul at @pgrizzaffi.

Describe what your presentation at DevOps Live was all about. 

At one time, most software teams wrote code using separate copies of the source code from the code’s storage location. These copies are called branches. One or more developers could use those separate copies to develop different features, minimizing up-front coding conflicts. 

Unfortunately, the longer developers worked on their own copies of source code, the better the chances that the code would diverge, meaning the harder it was to merge back into the main copy of the software that they were developing. That meant more problems were likely to surface – such as developer A accidentally over-writing developer B’s changes during a merge of code – and that these problems wouldn’t be discovered until late in the development cycle, when they were more difficult to address. 

Beyond that, some kinds of defects are harder to detect with this relatively infrequent run of test automation. 

One fix to these problems is what’s known as “continuous integration,” meaning the frequent merging of new or changed code into the main copy of software under development. Continuous integration includes automated build and test on code submission, making the process easier of merging code together. 

Any issues that can emerge with this approach? 

Running the test automation only when source code is added back to the storage location doesn’t actually cause defects. Rather, this approach is highly valuable.

The class of defects that this approach is not as good at catching is intermittent errors that are typically caused by “race conditions.” A race condition is a condition in which software has events that can happen in any order and still be valid events. Often, these kinds of events can be numerous in complex systems, making it difficult to account for all of them. 

What are some ways to help find those defects? 

Once way to attempt to catch these intermittent errors (often caused by race conditions), is to supplement your standard “run on every build” automation with “periodic automation.” Periodic automation is running all or a subset of your automated tests on a time interval. 

For example, if it’s been 30 minutes since your last automated build and test, run the automation again. The value here is that the more often you run the test scripts – i.e., look for a problem – the more likely you are to find existing race conditions – i.e.,  find an existing problem. 

How can companies help reduce the failure fatigue that comes along with the process of finding these defects?

The primary way to reduce failure fatigue is to make sure that your automation tool and your automated test scripts are testing the correct things and are consistent in their execution. A big cause of failure fatigue is inconsistent test script results – i.e., a test script passes most of the time, but fails once in a while. The fatigue comes from the team looking at the results and saying, “Yep, it was that same intermittent failure again. Just ignore it.” The issue is that the team becomes desensitized to the error. If an actual defect were the cause the failure, the team may miss it.

Another cause of fatigue is running your test on too frequent of a period. This fatigue comes from the fact that all results need to be evaluated. If the analysis of the results takes a significant amount of time, some of the results may not be analyzed, which could cause the team to miss an actual intermittent defect. There are a couple of possible solutions here, such as running the automation on a less frequent period, or making the results analysis a less effort-intensive task.