When it comes to software, inaccurate tests are worse than no tests

Performance testing is typically done to ensure that enterprise websites can scale up to support peak loads. This is all well and good — so long as they can provide an accurate assessment of what is typically going on in live systems.

But at the JavaOne conference recently, Gil Tene, CTO and co-founder of Azul Systems, argued that organizations are setting themselves up for problems if they don’t accurately specify and test the worst cases — an ability Service Virtualization was specifically invented to achieve.

Still, problems can an do occur both in real-world labs and with Service Virtualization if the performance testing is not done in a way that can accurately characterize worst-case performance. Tene says, “Very few people I know have a lab that mimics reality.”

Focus on worst cases rather than statistics 

He should know, over the years he has spent hundreds of hours poring over latency logs of application to identify challenging problems in software development. Azul systems makes a Java Virtual Machine with an efficient garbage collector for solving many kinds of performance problems. Worst-case performance problems can be caused by a variety of other factors, he says.

The main problem as Tene sees it lies in the way that organizations specify the performance of software applications in terms of service-level agreements with percentages rather than the worst cases. On paper, most organizations seem happy with 99.99% (so-called “four-nines”) or even 99.9999% (“six-nines”). Tene noted that these specifications are often reflected in tremendous jumps in performance degradation of applications at  these tiers.

In practice, Tene argues that organizations need to be more worried about the worst-case performance, and engineer their systems to meet these needs. For example, one high-frequency trading company Tene worked with saw the delays in its trading platform jump 300-times, owing to subtle problems in the software. Engineers that are focused on bringing the average performance of their systems down, typically increase the delays for the worst cases in the process.  

It is also important to note that server response time is not necessarily correlated with load. Quite often hiccups occur as one more or software processes such as garbage collection, Cron Jobs, and data re-indexing occur. A lot of the time you see these hiccups occur as part of accumulated work that is happening, rather than load, Tene says. “When they happen, you need to understand how often and how big they are to understand if they are acceptable.”

Lies, damned lies and statistics

Tene says that are a variety of systematic problems leading to the ways that organizations end up deceiving themselves about how software in the worst cases.

Everything is useless if the data logged into performance monitoring charts has gaps. Most load generators Tene has looked at such as Apache JMeter, wait until the previous response has come back. If a long delay occurs, most of these testing tools are unable to send the next request. The net result is that the magnitude of delays are improperly characterized.

This problem is complicated by the fact that most of these tools use TCP to generate the load, which has a characteristic delay in sending out subsequent load packets as the network stalls waiting for a response. The net result is that even load generating tools like YCSB designed to send a constant load, end up stalling between requests owing to networking delays.

A good way to determine if these problems affect your performance tests is to deliberately stall the server using CTRL-Z during test, wait, and restart it. If this substantial delay is not reflected in the performance results, the tool has a problem.

Putting the worst under a microscope

Tene recommends that organizations worried about the worst-case performance of their systems use a variety of open source tools Azul developed:

  • HdrHistogram is a library for plotting performance results in a logarithmic way that makes it easier to visualize application performance in the worst cases.
  • jHiccup is a tool for identifying brief disruptions in the performance of Java Virtual Machines caused by garbage collection, database indexing, and other problems.
  • LatencyUtils is a set of tools built on top of JHiccup for detecting pauses and automatically readjusting the statistics collected by performance measurement tools.

Creative Commons image by Sebastian Bergmann.