Lies, Damned Lies, And Statistics . . . And Benchmarks

I have been working on a research document, to be published this quarter, on the impact of 8-socket x86 servers based on Intel’s new Xeon 7500 CPU. In a nutshell, these systems have the performance of the best-of-breed RISC/UNIX systems of three years ago, at a substantially better price, and their overall performance improvement trajectory has been steeper than competing technologies for the past decade.

This is probably not shocking news and is not the subject of this current post, although I would encourage you to read it when it is finally published. During the course of researching this document I spent time trying to prove or disprove my thesis that x86 system performance solidly overlapped that of RISC/UNIX with available benchmark results. The process highlighted for me the limitations of using standardized benchmarks for performance comparisons. There are now so many benchmarks available that system vendors are only performing each benchmark on selected subsets of their product lines, if at all. Additionally, most benchmarks suffer from several common flaws:

  • They are results from high-end configurations, in many cases far beyond the norm for any normal use cases, but results cannot be interpolated to smaller, more realistic configurations.
  • They are often the result of teams of very smart experts tuning the system configurations, application and system software parameters for optimal results. For a large benchmark such as SAP or TPC, it is probably reasonable to assume that there are over 1,000 variables involved in the tuning effort. This makes the results very much like EPA mileage figures — the consumer is guaranteed not to exceed these numbers.
  • The cost metrics, where supplied (and some benchmarks, such as the SAP benchmarks, prohibit any cost comparisons), between the unreasonable configurations and the way they are sometimes calculated with large discounts buried in the report details, are of little use to a real-world user in determining their actual operating results, especially when coupled with the unreasonable configurations. For example, the TPC benchmarks are usually configured with excessive storage in unnatural configurations, which distorts the ability to comprehend the processing price-performance of the server unless additional detailed analysis is done by the user.

In the end, I ended up using the SAP benchmark, which despite not allowing explicit cost comparisons per workload, at least had published configurations for all the systems under consideration and has sufficient information to allow me to independently derive prices for the server components of the benchmarks.

With my consulting hat on, I would proffer the following advice to I&O professionals in regards to benchmarking for new server purchases:

  • Use your own workloads whenever possible for benchmarks to avoid all the limitations of standardized benchmarks.
  • Do not be shy about asking vendors to provide you with people and technical resources to assist you in doing so. If they want a big pile of your money, they should be willing to invest in the process. Do not expect them to run them with no participation from you — they expect some investment on your part as well. And you really should participate to understand exactly how it was run.
  • Don’t allow the vendor to use unrealistic configurations or tuning beyond the level that you are comfortable with or have the knowledge to implement within your production environment.

I’d love to hear from readers on this subject. How important to you are standardized benchmarks, and which ones do you pay attention to when you are looking to make a major server purchase?

Late addition: Since I first posted this blog, I've received a number of comments from vendors on the draft of the document that I circulated for a review of factual accuracy, and all responded with n=benchmarks that showed their systems to best advantage (no surprise there). Unofrtunately, when the dust had settled, still no way to make a good comparison across all the products. There seem to be enough benchmarks available that, much like Lake Woebegone, where "all the children are above average" everybody can be the best on some benchmark. Caveat emptor still rules.

*** Historical note — The quote about “Lies, damn lies and statistics” is often attributed to Mark Twain, but he actually attributed it to the famous British Prime Minister Benjamin Disraeli. Wikipedia (http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics) goes on to state that its actual provenance is somewhat cloudy since it never appears in any of Disraeli’s writings. But the sentiment embodied in the citation is clear even if its origins are not.