The problem with benchmarks
When we recently used our standard benchmark suite to test the performance of a new rugged computer, we thought it’d be just another entry into the RuggedPCReview.com benchmark performance database that we’ve been compiling over the past several years. We always run benchmarks on all Windows-based machines that come to our lab, and here’s why:
1. Benchmarks are a good way to see where a machine fits into the overall performance spectrum. The benchmark bottomline is usually a pretty good indicator of overall performance.
2. Benchmarks show the performance of individual subsystems; that’s a good indicator for the strengths and compromises in a design.
3) Benchmarks show how well a company took advantage of a particular processor, and how well they optimized the performance of all the subsystems.
That said, benchmarks are not the be-all, end-all of performance testing. Over the years we’ve been running benchmarks, we often found puzzling inconsistencies that seemed hard to explain. We began using multiple benchmark suites for sort of a “checks and balances” system. That often helped in pin-pointing test areas where a particular benchmark simply didn’t work well.
There is a phrase that says there are three kinds of lies, those being “lies, damn lies, and statistics.” It supposedly goes back to a 19th century politician. At times one might be tempted to craft a similar phrase about benchmarks, but that would be unfair to the significant challenge of creating and properly using benchmarks.
It is, in fact, almost impossible to create benchmarks that fairly and accurately measure performance across processor architectures, operating systems, different memory and storage technologies, and even different software algorithms. For that reason, when we list benchmark results in our full product reviews, I always add an explanation outlining the various benchmark caveats.
Does that mean benchmarks are useless? It doesn’t. Benchmarks are a good tool to determine relative performance. Even if subsystem benchmarks look a bit suspect, the bottomline benchmark number of most comprehensive suites generally provides a good indicator of overall performance. And that’s why we run benchmarks whenever we can, and why we publish them as well.
Now in the instance that causes me to write this blog entry, we ran benchmarks and then, as a courtesy, ran them by the manufacturer. Most of the time, the industry’s benchmarks and ours are very close, but this time they were not. Theirs were much higher, both for CPU and storage. We ran ours again, and the results were pretty much the same as the first time we ran them.
The manufacturer then sent us their numbers, and they were indeed different, and I quickly saw why. Our test machine used its two solid state disks as two separate disks whereas I was pretty sure the manufacturer had theirs configured to run RAID 0, i.e. striping, which resulted in twice the disk subsystem performance (the CPU figures were the same). A second set of numbers was from a machine that had 64-bit Windows 7 installed, whereas our test machine had 32-bit Windows 7, which for compatibility reasons is still being used by most machines that come through the lab.
The manufacturer then emailed back and said they’d overnight the two machines they had used for testing, including the benchmark software they had used (same as ours, Passmark 6.1). They arrived via Fedex and we ran the benchmarks, and they confirmed the manufacturer’s results, with much higher numbers than ours. And yes, they had the two SSDs in a RAID 0 configuration. Just to double-check, we installed the benchmark software from our own disk, and on the 32-bit machine confirmed their result. Then we ran our benchmark software on the 64-bit Windows machine, and… our numbers were pretty much the same as those of the machine running 32-bit Windows.
Well, turns out there is a version of Passmark 6.1 for 32-bit Windows and one for 64-bit Windows. The 64-bit version shows much higher CPU performance numbers, and thus higher overall performance.
Next, we installed our second benchmark suite, CrystalMark. CrystalMark pretty much ignored the RAID configuration and showed disk results no higher than the ones we had found on our initial non-RAID machine. CrystalMark also showed pretty much the same CPU numbers for both the 32-bit and the 64-bit versions of Windows.
Go figure.
This put us in a bit of a spot because we had planned on showing how the tested machine compared to its competition. We really couldn’t do that now as it would have meant comparing apples and oranges, or in this case results obtained with two different versions of our benchmark software.
There was an additional twist in that the tested machine had a newer processor than some of the comparison machines that scored almost as high or higher in some CPU benchmarks. The manufacturer felt this went against common sense, and backed up the conjecture with several additional benchmarks supplied by the maker of the chips. I have seen older systems outperform newer ones in certain benchmarks before, so I think it’s quite possible that older technology can be as quick or quicker in some benchmarks, though the sum-total bottom line almost always favors newer systems (as it did here).
The implications of all this are that our benchmark suites seem to properly measure performance across Windows XP, Vista and 7, but apparently things break down when it comes to 64-bit Windows. And the vast discrepancy between the two benchmark suites in dealing with RAID is also alarming.
It was good being able to use the same exact benchmark software to objectively measure hundreds of machines, but I am now rethinking our benchmarking approach. I greatly value consistency and comparability of results, and the goal remains arriving at results that give a good idea of overall perceived performance, but we can’t have discrepancies like what I witnessed.