IP Benchmarks and SoC Optimization
Benchmarks tend to be used a lot in marketing, and for good reason. They are carefully crafted to give an accurate representation of performance. This can sometimes cause some frustration on the part of our virtual prototype customers as they try mightily to replicate the benchmark numbers they saw in the marketing brochures. Achieving the performance number used in marketing benchmarks can be a lot of work. It's work worth pursuing however both to optimize the performance of individual IP blocks and to make the overall SoC more marketable itself.
One thing that we've definitely seen from customers though is that benchmark numbers can't be looked at in a vacuum. They can vary wildly depending upon a number of factors ranging from the configuration of the target IP block to cache layout, system memory wait states and a variety of other factors. If your processor isn't giving you the same benchmark numbers as you see in the marketing brochure, it's not typically a processor configuration issue but a design choice made somewhere else in the system. Virtual prototypes are ideally suited to identify these sytem bottlenecks and identify areas for optimization.
We typically see benchmarks used by customers for two different use cases: to make sure that the IP they're choosing will meet their requirements and to make sure that the chip or system being design will meet the requirements of the end customer. Individual IP benchmarks tend to be more bare-metal focused whereas system benchmarks tend to run more on top of the OS. There are always exceptions of course but they tend to follow this pattern.
Fig 1. Executing the CoreMark benchmark on the ARM Cortex-A9 Performance Analysis Kit (CPAK)
As it turns out, benchmarks are quite a natural fit with 100% accurate virtual prototypes. Virtual environments with less than 100% accuracy are typically poorly suited because design optimizations to simplify modeling requirements can have a drastic effect on
benchmark results (much the same way that a compiler setting which may seem rather innocuous for regular software can dramatically impact benchmark performance) This makes accurate virtual prototypes a natural fit for bare-metal benchmarks such as CoreMark which deliver results in fewer than a million cycles. We've done quite a bit of work with the CoreMark benchmark at customers and our A9 DMA DDR CPAK comes complete with the code changes needed to compile the CoreMark source available from their website.
As soon as the OS becomes needed for a benchmark to execute, the value of accuracy seems less intuitive to some users. After all, just booting Linux can take billions of cycles. Executing the system-level benchmarks can take millions or billions of cycles more. While you could do this on a 100% accurate virtual prototype, a bit of patience might be required. Thankfully, there are a few steps which can be taken to enable OS level benchmarks to be executed on an accurate virtual prototype.
OS Level Benchmarks
The first step is to reduce or even remove the time required to boot the OS. A few weeks ago, on of our applications engineers, Pareena Verma, blogged about the optimizations she had made to the Linux kernel to greatly reduce the number of cycles required to get up and running. You can either take these same steps or yourself, or you can save yourself a lot of work and download the CPAK for Linux for the ARM Cortex-A9 or Cortex-A15 which have these optimizations already applied (and source is included as well of course) You can then use Save & Restore to create a checkpoint right after the OS is booted and continue execution from there for future simulations. Alternatively, you can use Swap & Play to boot the system using ARM Fast Models and then continue running with 100% accuracy after any software breakpoint. The Swap & Play method is especially useful if you want to quickly get to a point of interest in a larger benchmark after it's been running for a bit. Swap & Play and Save & Restore will both enable you to run OS level benchmarks with 100% accuracy to guarantee correct results.
The Linux CPAKs I mentioned above support both Save & Restore and Swap & Play so you should be able to get up and running with little to no effort required. In fact, both Linux CPAKs also include binaries for the Dhrystone, Whetstone, LTP and LMbench benchmark programs to get you up and running more quickly. Of course, you can always add in your own benchmark as well using the steps in Pareena's first blog post.
Not All Cycles Are Created Equal
Following the steps above will certainly reduce the amount of time spent booting Linux and getting up and running but what happens when it comes time to run the system level benchmarks themselves? Some of them can run for billions of cycles. I'm not known as a patient person and I'm certainly not going to wait around for a billion cycles. As it turns out though, the percentage of meaningful cycles in benchmark code tends to be quite small, as in single digits small. The trick of course is knowing which cycles to look at. A great paper on this topic was put out a few years back by some researchers from UCSD. It examines a number of benchmarks and determines that they consist of multiple execution phases. The benchmark results in any one of these phases tend to be consistent (in other words, any metric that I'm tracking tends to remain fairly constant during the execution phase) so, by identifying the phases in benchmark execution, recording the phase results and then also including the interphase results (transitions from phase to phase can be the most interesting) you can get a VERY close approximation to the results obtained by running the complete benchmark. As you can see in the graph to the right that I've extracted from the paper, across billions of cycles there are a number of phases in the various tracked metrics but within any phase, there tends to be little change. Using the techniques from this paper it's possible to run only 1% of the overall cycles of a system level benchmark and still extract the same results.
This data leads to a few possible approaches to streamlining system benchmark execution on 100% accurate virtual platforms:
1) Execute the complete system in a Fast Model-based virtual prototype to identify phases and then modify the code to reduce the amount of time spent in each phase (since time spent in each phase merely reinforces what you already know) Rerun the modified code with virtual prototype accuracy at 100% to obtain the benchmark results.
2) Execute the complete system in a Fast Model-based virtual prototype and create Swap & Play checkpoints before each phase transition. Then execute each phase in a separate cycle accurate virtual prototype.
I personally prefer the second approach since it enables all of the phases to execute in parallel in separate cycle accurate simulations, doesn't require any code modifications and will give 100% accurate results. Execution time will be dictated by the longest phase however so if any single phase dominates the run time then it may be wise to modify that code somewhat since, as we've seen in the paper, the results in any single phase dont vary much.
Regardless of which approach you choose however, you can easily increase the overall speed of simulation by a few orders of magnitude without sacrificing accuracy. This makes the bit of extra work easily worth it.
Benchmarks aren't just for marketing anymore. They can be used very effectively to drive system optimizations at both the bare metal and OS level. Carbon CPAKs which bundle 100% accurate models, ARM Fast Models together with OS and bare metal benchmarks aim to get users up and running faster than any other way possible.