Thoughts about Computer Benchmark Testing
Computers don’t have a single horsepower rating, so how do you best determine their real world performance?
Unlike vehicles whose specifications include a horsepower rating, computers don’t have a single number that indicates how powerful they are. True, horsepower (what an antiquated term that is) isn’t really all that relevant, because that number alone doesn’t tell the whole story. The same engine that makes a light vehicle a rocket might struggle in a larger, heavier vehicle. And even if the vehicles weighed the same, performance might be different because of a number of other factors, such as transmission, aerodynamics, tires, and so on. That’s why car magazines measure and report on various aspects and capabilities of a vehicle: acceleration, braking, roadholding, as well as subjective impressions such as how effortless and pleasant a vehicle drives and handles.
So how is performance measured in computers, and why is “performance” relevant in the first place? To answer that, we first need to look at why performance is needed and what it does in both vehicles and in cars. In vehicles, performance clearly means how effortlessly the engine moves the car or truck. Step on the gas, and the vehicles instantly responds. And it has enough performance to handle every situation. That said, absolute peak performance may or may not matter. On the racetrack it does matter, but what constitutes enough performance for tooling through town and everyday driving is a different matter altogether. And that’s pretty much the same for computers.
The kind of performance that matters
The kind of performance that matters most in computers is that which enables the system to respond quickly and effortlessly to the user’s commands. And just like in vehicles, very high peak performance may or may not matter.
If a system is used for clearly defined tasks, all that is needed is enough performance to handle those, and everything above that is wasted. If a system may be used for a variety of tasks, there must be enough performance to reasonably handle everything it may encounter, within reason. And if a system’s task must be able to handle very complex tasks and very heavy loads, it must have enough peak performance to do that task as well as is possible.
What affects performance?
So how do we know if a system can handle a certain load? In combustion engines, the number of cylinder matters, even though thanks to turbocharging and computerized engine control that is no longer as relevant as it once was. Still, you see no vehicles with just one or two cylinders except perhaps motorcycles. Four was/is the norm for average vehicles, six is better, and eight or even twelve means power and high performance. And there, it’s much the same in computers: the number of computing cores, the cylinders of a CPU, often suggests its power. Quad-core is better than dual-core, octa-core is a frequently used suggestion of high performance, and very high performance systems may even have more.
But the number of cores is not all that matters. After all, what counts in computing is how many instructions can be processed in a given time. And that’s where clock speed comes in. That’s measured in megahertz or gigahertz per seconds, millions or billion cycles per second. More is better, but the number of cycles doesn’t tell the whole story. That’s because not all instructions are the same. In the world of computers, there are simple instructions that perform just basic tasks, and there are complex instructions that perform much more in just one cycle. Which is better? The automotive equivalent may be not the number of cylinders, but how big each cylinder is and how much punch it generates with each stroke. For many years Americans valued big 8-cylinder motors, whereas European and Japanese vehicle manufacturers favored small, efficient 4-cylinder designs.
RISC vs CISC
In computers, the term RISC means reduced instruction set computer and CISC complex instruction set computer. The battle between the two philosophies began decades ago, and it carries on today. Intel makes CISC-based computer chips that drive most PCs in the world. On the other side is ARM (Advanced RISC Machine) that is used in virtually all smartphones and small tablets.
What that means is that computing performance depends on the number of computing cores, the type and complexity of the cores, the number of instructions that can be completed in a second, and the type and complexity of those instructions.
But that is far from all. It also depends on numerous other variables. It matters on the operating system that uses the result of performed instructions and converts it to something that is valuable for the user. That can be as simple as making characters appear on the display when the user types on the keyboard, or as complex as computing advanced 3D operations or shading.
Performance depends not just on the processor, but also on the various supporting systems that the processor needs to do its work. Data must be stored and retrieved to and from memory and/or mass storage. How, and how well, that is done has a big impact on performance. The overall “architecture” of a system greatly matters. How efficient is it? Are there bottlenecks that slow things down?
The costs of performance
And then there is the amount of energy the computer consumes to get its work done. That’s the equivalent of how much gas a combustion engine burns to do its job. In vehicles, more performance generally means more gas. There are tricks to get the most peak power out of each gallon or to stretch each gallon as much as possible.
In computers it’s electricity. By and large, the more electricity, the more performance. And just like in vehicles and their combustion engine, efficiency matters. Technology determines how much useful work we can squeeze out of each gallon of gas and out of each kilowatt-hour of electricity. Heat is a byproduct of converting both gasoline and electricity into useful (for us) performance. Minimizing and managing that waste heat is key to both maximum power generation as well as efficiency if the process.
So what does all of that relate to “benchmarking”?
Benchmarking represents an effort to provide an idea of how well a computer performs compared to other computers. But how does one do that? With vehicles, it’s relatively simple. There are established and generally agreed on measures of performance. While “horsepower” itself, or the number of cylinders, means relatively little, there is the time a vehicle needs to accelerate from 0 to 60 miles per hour, or reach 1/4 mile from a standing start. For efficiency, the number of miles one can drive per gallon matters, and that is measured for various use scenarios.
No such simple and well-defined measurement standards exist for computers. When, together with a partner, I started Pen Computing Magazine over a quarter of a century ago, we created our own “benchmark” test when the first little Windows CE-based clamshell computers came along. Our benchmark consisted of a number of things a user of such a device might possibly do in a given workday. The less overall time a device needed to perform all of those tasks, the more powerful it was, and the easier and more pleasant it was to use.
And that is the purpose of a benchmark, to see how long it takes to complete tasks for which we use a computer. But, lacking such generally accepted concepts as horsepower, 0-60 and 1/4 mile acceleration and gas mileage, what should a benchmark measure?
What should a benchmark measure?
The answer is, as so often, “it depends.” Should the benchmark be a measure of overall, big picture performance? Or should it measure performance in particular areas or with a particular type of work or a particular piece of software?
Once that has been decided, what else matters? We’d rate being able to compare results with as many other tested systems as very important, because no performance benchmark result is an island. It only means something in comparison with other results.
And that’s where one often runs into problems. That’s because benchmark software evolves along with the hardware it is designed to test. That’s a good thing, because older benchmark software may not test, or know how to test, new features and technologies. That can be a problem because benchmark results conducted with different versions of a particular type of benchmark may no longer be comparable.
But version differences are not the only pitfall. Weighting is as well. Most performance benchmarks test various subsystems and then assign an importance, a “weight,” to each when calculating the overall performance value.
Here, again, weighting may change over time. One glaring example is the change in weighting when mass storage went from rotating disks to solid state disk and then to much faster solid-state disk (PCIe NVMe). Storage benchmark results increased so dramatically that overall benchmark results would be distorted unless weighting of the disk subsystem was re-evaluated and changed.
Overall system performance
The big question then becomes what kind of benchmark would present the most reliable indicator of overall system performance? That would be one that not only shows how a system scores against current era competition, but also to older systems with older technologies. One that tests thoroughly enough to truly present a good measure of overall performance. One that allows to easily spot weakness or strengths in the various components of a system.
But what if one size doesn’t fit all? If one wants to know how well a system performs in a particular area like, for example, advanced graphics? And within that sub-section, how well the design performs with particular graphics standards, and then even how it works with different revs of such standards? That’s where it can quickly get complex and very involved.
Consider that raw performance isn’t everything. A motor with so and so much horsepower may run 0-60 and the 1/4-mile in so and so much time. But put that same motor in a much heavier car, and the vehicle would run 0-60 and the 1/4-mile much slower. Computers aren’t weighed down by weight, but by the operating system overhead. Recall that the earliest PCs often felt very quick and responsive whereas today’s systems with technology that’s hundreds or thousands of times as powerful can be sluggish. Which means that OS software matters, too, and its impact is rarely measured in benchmark results.
Finally, in the best of all worlds, there’d be benchmarks that could measure performance across operating systems (like Windows and Android) and processor technology (like Intel x86 and ARM). That does not truly and reliably exist.
How we measure performance
Which brings me the way we benchmark at our operation, RuggedPCReview.com. As the name indicates, we examine, analyze, test and report on rugged computers. Reliability, durability and being able to hold up under extreme conditions matter most in this field. Performance matters also, but not quite as much as concept, design, materials, and build. Rugged computers have a much longer life cycle than consumer electronics, and it is often power consumption, heat generation, long term availability, and special “embedded” status that rate highest when evaluating such products. But it is, of course, still good to know how well a product performs.
So we chose two complete-system benchmarks that each give all parts of a computer a good workout. We decided to standardize on two benchmarks that would not become obsolete quickly. And this approach has served us well for nearly a decade and a half. As a result, we have benchmarks of hundreds of systems that are all still directly comparable.
A few years ago, we did add a third, a newer version of PassMark, mostly because our initial standard version no longer reliably ran on some late model products.
Do keep all of that in mind when you peruse benchmarks. Concentrate on what matters to you and your operation. If possible, use the same benchmark for all of your testing or evaluation. It makes actual, true comparison so much easier.