In the previous parts of this blog series, I discussed the procedure of Carbonizing the RTL of an ARM Mali-450 GPU and building an SoC Designer component, followed by the description of a virtual prototype environment that runs cycle-accurate integration tests provided by ARM on the Mali-450 GPU using a Cortex-A15 based system.
This part describes a Linux-based Carbon Performance Analysis Kit (CPAK) containing a Cortex-A15 based system with an ARM Mali-450 GPU. This CPAK enables the user to integrate the Mali GPU drivers into the Linux Kernel and registering the Mali driver while booting Linux on the virtual prototype environment. We’ll do this by combining the speed of ARM’s Fast Models together with the accuracy of the Carbonized GPU model in the same system.
Partitioning the System
The block diagram in Figure 1 shows the Mali-450 CPAK that we’ve been working with so far. This system is comprised entirely of 100% accurate models which enable us to see the impact of our IP configuration settings and bare-metal software. As we migrate to higher level tasks such as driver development, we can map portions of the system to more abstract models in order to execute at faster speeds. SoC Designer automatically understands the relationship between Carbonized models of ARM IP and their Fast Model equivalents so all we need to do to is tell the Fast Model Generator tool in SoCDesigner which models we would like to execute accurately, and which ones should be represented as Fast Models.
The dotted line indicates the portion of the virtual prototype we’d like to execute as ARM Fast Models. The rest of the system will continue to execute as cycle accurate models. Once we’ve identified the models for each domain, SoCDesigner will automatically build a new virtual prototype representation containing both Fast Models and Carbonized, 100% accurate models. Figure 2 shows a block diagram that describes such a system. All of the necessary transactor logic to map between the abstract models and the accurate models is inserted automatically.
Using the CPAK with ARM Fast Models
Now that we’ve updated the CPAK to ARM Fast Models, let’s put it to use. Since the virtual prototype will run at ARM Fast Model speeds when the accurate models are not being used, we can now use this platform to quickly boot Linux and integrate the Mali device driver. In order to do this, I integrated the Mali Linux drivers into the Linux kernel that was used as part of the A15 Linux CPAK. The Linux drivers are part of the Mali-450 GPU Linux Driver Development Kit (DDK) provided by ARM. The DDK contains the Mali Linux device drivers and the base drivers. It also provides links to download the OpenGL drivers, specific to Mali-450 GPU. The Integration Guide was used, along with minor modifications specific to the virtual prototyping environment, to integrate the software into the Linux kernel. The integration procedure involved the following steps:
- Integrating and building the Mali Linux device driver: The user specifies the macros corresponding to the GPU configuration parameters, along with configuring the Mali GPU memory, the framebuffer memory and power management options. The device driver was built as part of the kernel image, but it can also be built as a kernel module.
- Building the OpenGL and GLES libraries and adding them to the root file system.
- Building the GPU benchmarks and adding them to the root file system.
- Building the kernel image that contains the Mali device drivers, base drivers, the OpenGL drivers and the GPU benchmarks.
This CPAK can prove to be useful to device driver developers who could step through the process of loading the Mali device drivers during boot-up. Figure 3 shows the messages printed by the console while loading the Mali Linux drivers with debug messages turned on. It describes the whole procedure of registering the drivers that starts with initializing the Mali memory system and ends with initializing the platform device, provided all intermediate steps pass. The intermediate steps define the settings for the framebuffer, the dedicated and shared memory. Each of the internal components of the Mali-450 that include the L2 cache, the Graphics Processor, the Pixel Processors and the Dynamic Load Balancing unit are then created and the corresponding base addresses are defined.
With the driver ported to Linux, it is now possible to quickly boot the OS in just over two minutes and then start processing frames with 100% accuracy. The benchmark that uses the Mali base drivers runs test suites on:
- GPU memory allocation
- Initialization of the Graphics Processor
- Initialization of the Pixel Processors
- Running vertex shader jobs
- Running rendering jobs that draw a simple triangle.
Getting the whole system up and running took me 3 weeks, but all of the work that I’ve done is incorporated into the Mali-450 CPAK meaning that you can duplicate the same results within minutes of downloading the package. The package includes an app note that describes the steps that the user would need to take to use pre-built binaries to perform the integration process.
Fall is always a busy time of year here in New England where Carbon is based. The students are back at the colleges and universities, leaf peepers clog up all the roads and for a brief period all four major pro sports teams are playing (Go Red Sox!). Fall is also Carbon's busy time for conferences as well. Although we don't typically exhibit at many tradeshows, we do try and have a presence at most of ARM's technical conferences. This flurry of activity starts with ARM® TechCon™ in Santa Clara and typically ends six weeks later with the European Technical Symposium in Paris. Although it makes for a large amount of travel, it's a great opportunity to speak with lots of ARM designers and programmers about the challenges they're facing.
Carbon will be present at most of these conferences (the full list will be at the end of the blog) and making presentations at most of them as well. At ARM TechCon, we'll be doing a joint presentation with ARM entitled "Getting the Most Out of the ARM CoreLink™ NIC-400." This presentation will go over some of the highlights of ARM's NIC-400 product and then discuss a two-step methodology to optimize this crucial piece of IP to best attain your design goals. After last year's ARM TechCon joint presentation with ARM, "High Performance or Cyle Accuracy? You Can Have Both!" we made the corresponding whitepaper and presentation available for download and we will do that once again this year after the conference.
At the other conferences, we'll be presenting "Getting the Most Out of Advanced ARM IP" which will discuss methodologies to optimize some of ARM's other recently announced IP blocks such as the ARM Cortex™-A57, Cortex-A53, Cortex-A15, Mali and several others. While I don't want to give away the entire presentation here, you can be sure it will probably make some mention of implementation accuracy, performance optimization and executing pre-silicon firmware.
If you're attending any of these conferences, please come by the Carbon booth and say hi. We'd love the chance to talk with you about the latest work we've been doing to enable customers to optimize their ARM-based SoC designs and get the most out of advanced ARM IP.
Carbon will be participating in the following conferences
||Santa Clara Convention Center, Santa Clara, CA
||Intercontinental Seoul Coex, Seoul, Korea
||Sheraton Hongqiao, Shanghai, China
||Sheraton Dongcheng, Beijing, China
||Ritz-Carlton Futian, Shenzen, China
||ARM Technology Symposium
||Tokyo Conference Center, Tokyo, Japan
||ARM Technology Symposium
||CAP 15, Paris, France
As system on chip (SoC) designs have grown from 10s to 100s of millions of gates, designers have had to go to great lengths to deliver designs which are well differentiated from the competition. Whereas the majority of the content of previous generation SoCs may have been designed internally or created from scratch for a new generation, this is certainly not the case nowadays as the vast majority of intellectual property (IP) blocks are reused from previous designs or, more likely, purchased from external sources. You only need to look at the fantastic market success enjoyed by IP companies such as ARM and Arteris to see how much the industry now relies upon third party IP to drive their system. (The recent spate of purchases by Cadence® shows how much importance they certainly see in this market segment.) As the third party content of the chip rises however, it becomes increasingly difficult to differentiate your SoC design from others in the marketplace. If everyone is designing using an ARM® Cortex™-A15 processor, Arteris® FlexNoC™ interconnect, Cadence memory controller and Imagination Technologies® GPU and you’re using the same IP how can you differentiate your system design, especially if you’re all using the same fab?
As the leading provider of 100% accurate virtual IP models and systems (including models from ARM, Arteris, Cadence, Imagination Technologies and many more), Carbon has seen this exact problem play out time and time again throughout our customer base. Differentiation is achieved in many ways of course, but the most advanced approaches typically focus around a few key design areas (and I’ll confine my discussion to the front end of the design cycle here since it’s where you can make the most impact): configuration, integration, and power instrumentation.
IP configuration seems intuitively obvious, but the complex interactions between all of the various options in any single IP block can lead to a huge difference in the achievable performance. Tie that block in with a number of other, similarly configurable blocks and the possible options grow exponentially. Leading edge designers are taking a two-stage approach towards IP configuration: stand-alone and integrated. Stand-alone optimization is just what it sounds like. The IP block is placed in a very simple test system where the IP block is modeled along with traffic generators and receivers on all of the various relevant ports. This is intuitive of course because it’s how many blocks are verified. Instead of focusing on verification however, the emphasis is on the interplay of various configuration options. The IP options are exhaustively modeled and then subjected to representative system traffic. This approach can quickly point out which options will best meet your target specifications. It seems intuitively obvious but only a small number of IP teams were employing this approach for non-interconnect IP when we first started working with them. A recent design was able to double the performance on memory reads by applying these methods to a memory controller configuration that was already shipping in an earlier generation product.
In the interconnect world, the story is a bit different as it’s quite common to construct a virtual platform containing traffic generators combined with real stimulus in order to try out various configuration settings. Carbon has partnered with Arteris to make pre-built virtual prototypes (called Carbon Performance Analysis Kits™ or CPAKs™) available for exactly this purpose. These CPAKs are available containing various ARM processors (including the ARM Cortex-A15, Cortex-A7 and Cortex-A9 with more to come) as well as multiple traffic generators all tied together either with generic interconnect components or IP from ARM or Arteris. Bare metal and OS is software is bundled as well to configure and drive the system. It’s an ideal starting point for quick optimization or customization to more closely reflect your own design.
Screenshot of a CPAK featuring a dual-core ARM Cortex-A9 together with an Arteris FlexNoC and multiple configurable traffic generators
Achieving the best system performance is not possible on a per component basis and must be looked at within the context of a system. Traffic generators can give a first approximation of real traffic but the only true way to validate the performance characteristics of a system is to pull it together and have it execute real software. Now the true impact of various configuration settings can truly be seen and problems uncovered. The interaction between high priority arbiters and high bandwidth components seems to be an especially problematic one as the performance is tweaked to make sure that components aren’t starved and power isn’t wasted (overdesign is just as bad a problem as under-design when you’re on the bleeding of mobile devices)
The impact of software here cannot be overstated as it is ultimately the system software that determines how well the overall system will function and how much power it will burn. Ensuring that the software can correctly interact with the hardware to achieve the desired performance is a key milestone and leading edge designers subject our virtual prototypes to billions of bare-metal and OS-driven cycles running benchmarks of various flavors to not only stimulate the systems in various software-driven ways but also validate that the product being built can live up to the marketing claims they’ll be making. Have you noticed that certain companies seem to always produce the fastest phone chips? They’re the ones that don’t look at speed optimization as a hardware problem or a software problem but rather an integrated system problem.
Here's a whitepaper published a few years ago by Samsung discussing how they used a cycle accurate virtual prototype to optimize the performance of their software even after the silicon had already been fabricated. Ideally you do this step earlyier in the design so you can impact the hardware decisions as well but the results which Samsung obtained are still impressive.
Software Driven Power Optimization
Hold it. Power? Didn’t we promise to focus on front-end issues? Although traditionally relegated to the back end of the design cycle, power decisions are being moved forward in the design cycle and many leading edge SoCs today have a concerted effort to measure the consumption of their devices while it’s still straightforward to make design decisions to reduce the power consumed by the system. This is typically done by executing system software on an accurate virtual prototype and then tracking the various power metrics. This can be implemented using a straightforward approach such as dumping waveforms while executing system software and then analyzing these results in your favorite EDA power tool. Software-based power vectors give a much better indication of how the device will actually perform and enable much more meaningful power decisions than are possible with vectors derived from an RTL testbench.
More sophisticated customers have adopted an instrumentation flow to enable power analysis. Instead of dumping waveforms during execution and then running these through a power analysis tool, they do a preliminary step of creating power number which correspond to the various power states of each model and then instrument the model using callbacks to dynamically track these states throughout the system. This enables runtime power analysis to be done without requiring waveforms and third party power tools. It also means that the system itself runs substantially faster which is always a big benefit. We’ve published a whitepaper which details the implementation steps required for both steps.
Evolving design techniques and an increased reliance upon third party IP have eliminated many of the approaches that designers have used to differentiate their products and distance themselves from the competition. When one door closes however, another opens and the design areas discussed here are being used today by leading-edge design teams to attack this space and create cost-effective, differentiated products. Adopting these approaches along will enable your designs to keep pace and stand out in the crowd.
Note, this post was adapted from an article we included in the most recent newletter from Arteris.
Much has been written about the ways to accelerate an SoC design schedule. If you added up all the marketing claims made by the various EDA companies on time to market savings, you’d end up being able to ship your advanced SoC months before you even conceive of the idea. We’ve been working on a series of blogs lately focused around SoC design issues and questions that were laid out in the Asking the Tough Questions blog a few weeks ago. So far we’ve talked about choosing IP, configuring it correctly and optimizing your memory subsystem. Today we’re introducing software into the equation. Instead of talking about how this can pull in your overall design schedule though (since that gets written about all the time) I’d like to focus on how software integration can be a true product differentiator. Driving your SoC under development with system level software doesn’t just get things done more quickly. If done properly, you can use your software to drive the validation and optimization of the SoC being designed and create a more differentiated system-level offering.
Assembling the System
The first step in the process of course is to assemble the system that you’ll be optimizing. This can be a stumbling block of course since, very often, the people optimizing the system don’t have the intimate design and modeling knowledge to prototype it. This leads to a few possibilities: bite the bullet and learn about all the IP and steps involved to tie it together and configure it; don’t worry about the lower level details of the system just abstract them away; take an existing system model from someone else and customize it to meet your needs. Since we’ll be talking about optimizing a system early in the design process, we’ll confine the discussion to virtual prototypes since physical prototypes and emulators are available too late in the design cycle to enable system level optimizations.
There are pluses to all of these approaches of course. Learning about all of the IP being used in the system and then assembling a virtual prototype has a definite allure. After all, it’s good to know about the blocks in your system and everyone likes learning new things. Of course, there’s a downside to this as well. By the time you know enough about the complexity of each block in order to tie a system model together (and possibly needing to create virtual models for some IP), the design window is likely to have passed.
Abstracting away all the lower level implementation details has a great deal of appeal and it’s how a lot of virtual models are created. This approach is great for high level software development since a properly created model will be functionally equivalent to the actual IP but just without the timing details. This enables the model to run fast and also be developed quickly without knowing a lot of the lower level details about the IP block. The downside of course is that these lower level timing details are really needed if we want to optimize the system level performance of the system. You can try and inject some level of timing details to mimic the performance of an actual IP block but this quickly leads to approximations built upon approximations built upon a model which never attempted to model timing accuracy in the first place. If you really need to see the performance impact of code running on a multi-issue, out or order, dual pipelined, multicore processor, are you going to trust timing approximations which you insert into your high level model which doesn’t even model the pipeline, let alone all of the speculatively executed code needed to fill those pipelines? In the end, a more accurate model is needed if we’re truly going to optimize system level performance.
If hardware prototypes are available too late, creating the virtual models yourself takes too long and using high level virtual prototypes leads to incorrect answers, how can you quickly create a prototype which is accurate enough for system level optimization?
Speed and Accuracy
This, of course, is where Carbon enters the picture. SoCDesigner Plus enables you to combine 100% accurate models from leading IP vendors such as ARM, Imagination Technologies, Cadence, Arteris and others which are available from Carbon’s IP Exchange web portal. These can be combined together with SystemC models, user RTL models compiled using Carbon Model Studio and also ARM’s Fast Models. Support for ARM Fast Models isn’t unique to Carbon of course, every vendor offers this. Only Carbon however offers the capability to mix these fast, functional models with their 100% accurate equivalents and boot Linux or Android in seconds and then debug software or profile performance with 100% accuracy. We’ve talked about this before a few times (here and here for example) So instead of talking about the feature itself, let’s dig a bit deeper and talk about how this feature can be used to differentiate your SoC design.
Figure 1: Profiling the Linux boot process on a Cortex-A15 CPAK
A System Approach
The key to optimizing system performance is to address it as a system problem, not a hardware or software issue. Teams spend huge amounts of effort optimizing the performance of the hardware subsystem and an equally large amount of time enhancing the speed of the software but it’s far too rare that these two tasks are performed concurrently. This means that there are missed opportunities for additional levels of optimization.
Having a virtual prototype which is both fast and 100% accurate enables this system level optimization to take place. The typical first step is to port one of the industry standard benchmarks to the platform. This is always a valuable step since the benchmark will be run on the finished silicon to provide marketing positioning. Why not execute the same benchmark during the development process to identify areas for optimization?
The initial port of software benchmark code can be easily be done on a high level representation of the system. In our case, this means running with ARM Fast Model representations of the processor(s) and related IP instead of their Carbonized equivalents. This fast platform is great for functional code porting but doesn’t reveal much about system performance. Therefore, once the code is ported and up and running it’s time to start running accurately to really start getting the value from running the benchmark. Swap & Play is great for this task since you can run quickly up to the point of interest, just before the benchmark starts for example, in the Fast Model representation and then switch over to the 100% accurate representation of the system to get correct results. You can then use these results to optimize the interaction of the various components of the system. SoCDesigner’s performance analysis visualization tools enable you to quickly identify performance bottlenecks and how they correspond to other activity in the hardware or software.
System software can be similarly optimized. When executing software on the accurate representation of the system it is straightforward to track the performance characteristics of both the hardware and software using SoCDesigner’s visualization tools. Now it is quite simple to draw correlations between software routines and their impact on overall system performance. This is obviously of great importance as system complexity increases. We’ve blogged previously about how Samsung used this approach with great success to optimize the software performance of a hybrid disk controller after silicon was already delivered. The whitepaper discussing this is available as well. This same approach is even more valuable when the hardware is still in a state where it can also be optimized. After all, system optimizations are typically most effective when they’ve based upon hardware and software features.
The approaches below are just scratching the surface of the optimizations that are possible when running real software on real hardware in advance of silicon. We didn’t even touch on the verification values which can also be obtained and there are numerous additional optimizations which can be uncovered as well. We’ll leave those for a future blog.
Bill’s last blog summed up the tough questions SoC architects face at various phases of the design process and Andy’s last blog provided a great description of how Carbon customers tackle the many questions that arise during the IP selection process. In this blog, I will move on to the interconnect and the questions involved with optimization to allow for a balanced SoC: For performance critical paths, how can the bandwidth be maximized while minimizing the average latency? How can less performance sensitive paths be managed without disrupting higher priority traffic? How can the fabric strike the appropriate balance between throughput and latency for all paths through the bus? What is the best way to isolate and eliminate performance bottlenecks? How will cache coherency impact the interconnect traffic – and system throughput?
Since this topic is a lot deeper than can be explored in a blog posting, you can get a much more in depth analysis by downloading our whitepaper on interconnect optimization.
Chicken or Egg?
The SoC architect has encountered a unique challenge with selecting and optimizing the system bus. Usually the CPU core(s) and possibly the GPU are selected prior to interconnect but other IP blocks including the main memory controller, DMAs, general purpose peripherals are yet to be selected. Thus the interconnect is to be designed and optimized with an incomplete understanding of the actual workloads that it will need arbitrate and load balance, as well as uncertainty of the latency and throughput constraints that the actual slave devices will impose. It is no surprise that temptation to overdesign a low latency mesh is so strong, as Bill noted.
The paradox is that the key design decisions in optimizing the fabric require an understanding of the final system. Consider the performance sensitive path between the CPU and main memory. The interconnect must harmonize with the main memory controller configuration and programming such that the memory bandwidth is maximized while the latency through the bus is minimized. This design tradeoff is fundamental and must be accurately profiled, characterized, and displayed for the SoC architect.
The relationship between latency & throughput visualized in SoC Designer
A comprehensive I/O prioritization scheme must be established such that the fabric can provide flow control and arbitration to deliver appropriate QoS for the application. Moreover, real world dynamics such as transient shifts in traffic, head-of-line blocking, can completely disrupt the designed flow control of the interconnect. Download the whitepaper for a more in depth look at impact of slave device latency variations on traffic prioritization.
Carbon customers have found effective ways of resolving this “chicken or egg” design dilemma of configuring interconnect to meet system performance objectives while providing robust QoS early in the design process, when the exact I/O workloads and target constraints are unknown.
A Two-Phased Approach:
The first phase involves configuring and building a virtual prototype to quickly and easily isolate performance bottlenecks.
Phase 1 – Architectural Exploration
- 100% cycle accurate model of interconnect available from Carbon IP Exchange
- Carbon AXI traffic generators
- Flexible memory sub-system models
Allows fast cursory optimization:
- Broad traffic profiling and sensitivity analysis
- Transaction tracing and back-pressure identification
- “What if?” analysis
Architectural exploration platform with Arteris FlexNoC
As IP blocks are selected later in the design process, the virtual protype can be reused and modified to incorporate 100% implementation accurate models in place of traffic generators:
Phase 2 –Real World Virtual Prototype
- 100% cycle accurate models of CPU & other IP
- Real world application & traffic
- HW & Software Interaction
Allows architectural validation:
- Accurate traffic profiling
- Incremental optimization as more IP blocks are selected or designed
- Cache coherency analysis
Arteris FlexNoC based Real World Virtual prototype
Design tradeoffs explored earlier in Phase I can be revisited and validated against actual multicore CPU traffic. The platform can be further modified as DMA IP blocks and memory controllers are selected.
Read more about this iterative two-phased approach to interconnect optimization in our white paper.
Coherency & Accuracy
Hardware based cache coherency in SoC’s has introduced significant complexity to the interconnect optimization process. Artificial workloads from custom traffic generators are not well suited to replicate coherency operations alongside the application workloads with high fidelity to the coherent workloads from actual IPs. Virtual prototypes, such as the dual ARM® Cortex™-A15 with coherent ARM CCI-400 interconnect, allow for 100% implementation accurate simulation of coherent multiprocessor CPU traffic:
Carbon A15 bare metal CPAK multi-processor reference platform
The system above, from the Cortex-A15 Carbon Performance Analysis Kit (CPAK) allows for complete visibility of ACE traffic workloads in a SoC Designer environment. The relationship between the hardware coherency handshaking events in the CCI-400 can be profiled and visualized alongside the bus traffic to get a deeper understanding of the correlation between hardware and ACE traffic.
CCI-400 Coherency event profiling in relation to A15 ACE interface transactions
Understanding the impact of coherency operations within the interconnect and to the overall system performance can help architects partition their designs across multiple networks on chip.
Download the white paper for an in-depth look into how cache coherency affects bus traffic and how the SoC Designer environment facilitates the analysis of this complex interaction.
It always seems that there are more questions than answers when embarking upon any new endeavor. System on chip (SoC) design is no exception to this rule. There are many approaches that you can use in an attempt to answer these questions. How much can you trust the answers that you get and how much of your future do you want to gamble on those answers? The best way to answer that may be by asking a lot more questions…
Assuming that you’re designing an SoC, a typical design process starts with IP selection. There is a lot of IP available from different vendors. How do you choose the IP that best meets your design needs from a price, performance and area perspective? Do you need an ARM® Cortex™-A15 or can you meet your design goals with a Cortex-A9 instead? Should you reuse the hardware codec from the previous design or replicate that functionality in software for this revision? It’s tempting of course to just choose the latest, fastest IP available but this approach has all sorts of potential issues. You may end up with a very overdesigned SoC. This design may blow away its performance targets but at the expense of much higher IP costs and potentially much higher power consumption.
Once you’ve chosen your IP blocks, either from external or internal sources, they need to be configured to play together efficiently to meet your design goals. Once again, it may be tempting to use a fully connected mesh to hook together all of your IP with wide, low-latency busses but you’ll probably spend more power than you need and have unnecessarily fast speeds on some data transfers. How will cache coherency cycle figure into your bus utilization? The latest coherency extensions to the AMBA bus protocols are dramatically changing the shape of system bus traffic. How much will this impact overall system throughput? If the design is partitioned into multiple fabrics (and most advanced designs do indeed contain multiple fabrics or networks on chip) what will be design impact of placing an IP block in one domain versus another?
The interconnect is of course only a part of the system performance equation. How does your memory controller or controllers factor into the system performance equation? Do you have too much latency on your processor to memory datapath? Have you overdesigned the system in such a way that the data comes back quickly but you’re burning too much power? What is the impact of this on your backend layout? Does your interpretation of how the arbitration priority of your memory controller match what will really happen on the bus when all of the components are tied together?
You can’t underestimate the impact of cache sizing and layout on your system performance as well, especially with the new AMBA Coherency extensions. A few cache sizing and configuration decisions can have a dramatic impact on system performance and resultant snoop traffic. How will your cache size impact system performance? How much will that impact change when you start running different software?
Choosing and configuring your IP is only the start of your system design problem. A system is far more than just hardware and there needs to be a way to get all the layers of the software stack up and running on the system as well, preferably well before silicon is taped out. This is simple enough for high level application software but what about your firmware, drivers, diagnostics and other software that actually depends upon the hardware functionality of your new system? How will you validate the impact of software on system level performance and power consumption?
There are typically substantially more engineers writing this software than there were hardware engineers to create and verify the hardware. How will they all be enabled to develop and debug their portion of the system design? When they uncover hardware problems (or at least problems that they blame on the hardware!) how will these problems be debugged? Will the problem be thrown over the wall several times as fingers are pointed or will there be a common debug mechanism to enable the hardware and software engineer to work together? Will that solution be affordable enough for everyone to use it or will it be a scarce, productivity-limiting resource with a signup sheet filled 24 hours a day?
Finally, what happens if you want to enable your end customers to be productive on my system design before silicon is produced? After all, they may need to write software for the system too and if they have to wait until after silicon to accurately develop this software you’ll need to wait longer to start making money on my design. If this whole process takes too long, the design may even miss its market window entirely.
These questions are not unique, nor are they new. They’ve been growing in importance however. As more and more chip designs migrate from custom ASICs to SoCs, the system-level design problems which used to be faced by just a few teams are now faced by almost everyone. There are a variety of solutions in the market to address these problems ranging from RTL simulation to high level virtual prototypes to all manner of hardware prototypes. Over the course of the next few weeks we’ll look into each of these areas in more detail, discuss the questions that our customers are asking and show how they’re getting answers that they can trust.
In Part 1 of the blog, I discussed the procedure of building a Carbon model and an SoCDesigner Plus component from the RTL of an ARM® Mali™-450 GPU that recently became available in the market. In this blog, I will discuss a bare-metal system I built on SoC Designer Plus that consists of an ARM Cortex™-A15 processor, the interconnect subsystem, memory, a MALI-450 GPU model, an external interrupt controller and a UART display as shown in the block diagram (the block diagram corresponds to the one that was available with the Integration manual provided along with the RTL). I will also discuss how easily I could use the test harness provided by ARM as a part of the Platform Integration kit along with the RTL to run simulations that test the Mali-450 GPU SoCDesigner Plus model.
The ARM Mali-450 GPU Bare-Metal System:
The system is built such that the Mali-450 GPU has two AXI master ports that are used to access shared memory through the interconnect. The Cortex-A15 processor programs the GPU by configuring the internal GPU registers through the interconnect subsystem and the APB3 interface. The GIC-400 is configured to generate interrupts corresponding to each processor internal to the Mali-450 GPU. The UART is used to display characters and in this case, it displays contents of internal GPU registers and test results.
The Mali-450 GPU model was configured to have four Pixel Processors (PPs), two L2 Caches, one Geometry Processor (GP), and no Power Management Unit (PMU). As discussed in my previous blog, this configuration was provided to a customer who had initially requested an So Designer Plus model as soon as the RTL was made available by ARM.
The Test Infrastructure:
Along with the RTL, ARM also provided an Integration Kit that consists of tests to confirm that he GPU is correctly integrated into the system. These tests confirm if the internal APB registers of various components in the GPU (like the PPs, the GP, the L2 caches, the DMA unit, the Broadcast unit, and the Dynamic Load Balancing Unit) are readable and writable. The results of the tests are displayed on the UART. Tests that confirm if the IRQ pins are connected correctly, and the GIC-400 is properly configured are also available as a part of the test infrastructure. The boot code corresponding to the ARM processor, the driver code corresponding to the Mali-450 GPU, the GIC-400 and the UART were provided with the Integration kit.
I started building the system on SoCDesigner Plus by configuring the PL301 interconnect using ARM AMBA designer with address ranges defined for the UART, GIC-400 Interrupt Controller, memory, and the GPU (these details were provided in the Integration Manual). Considering the fact that the GPU required an address range of 192kb and each APB3 slot on PL301 provides only up to 4kb, 48 APB3 slots were defined and eventually merged using the APBMerger component.
After having configured the PL301 interconnect, I only needed to put the whole system together on SoCDesigner Plus Canvas and run the test application. By just providing the testname in the Makefile, the application file (axf file) to be loaded into the SoC Designer Plus simulator can be built. These axf files were loaded and the test applications executed immediately without any manual intervention. I had heard from colleagues that example code provided by ARM is easily ported using SoCDesigner Plus but it definitely was a pleasant surprise to me considering the minimal effort I had to make to get the simulations to run correctly. I did have to make few modifications to the driver code corresponding to the UART and GIC-400 since the Integration Manual referred to different versions of those peripherals, and the integration tests were devised for a different GPU configuration. Thanks to the excellent debugging and monitoring capabilities provided by SoC Designer Plus, it took me minimal effort to make those changes.
As easy as all of this was for me to do, our customers won't have to repeat the steps since my work is available for your use. The bare-metal system I put together on SoC Designer Plus is now available as a Carbon Performance Analysis Kit (CPAK) for customers. The integration tests described above are also bundled in and available as a part of the CPAK.
In part 3 of this blog, I will describe the procedure of patching the Mali-GPU linux driver on the Linux CPAK, and how simulations can be run on SoC Designer Plus where the GPU processes certain images and displays them on an LCD.
For those that have been following Carbon's Blog for a while, you may have noticed that many of the entries have how customer have used Carbon Technology to solve or identify a problem they were having. One example of such an entry, is Eric's blog about DDRx Memory Controller Selection and Optimization. In this entry Eric talks about the benefits of using Carbon's DDRx memory solution and why this is a critical component in the virtual prototype. Another common thread has been around CPAK's and their ability to reduce the time customers spend in getting the initial prototype up and running. Pareena wrote an excellent entry about booting Linux on the ARM® Cortex™ A15.
What if you could combine the benefits talked about in these previous blog entries? What opportunity or opportunities would that open up to you? Today I will talk about how Carbon's solution is addressing this for our customers worldwide.
Advanced Performance Optimization
During this phase, typically customers will want to run a set of benchmarks that they can run on top of an operating system. An example of this might be Dhrystone, Coremark or tiobench on top of Linux. For those not familiar, tiobench is a multi-threaded I/O benchmark that is used to measure file system performance. All of these cases require a significant number of simulation cycles to complete. Unfortunatly, many people come to the conclusion this use case is not an option for a cycle accurate solution. This could not be further from the truth. Instead they opt for Cycle Approximate models, which can lead to inaccurate and un-optimized SoC or they skip this critical optimizaiton step. The good news is that you don't have to accept inaccuracy and you don't have to accept skipping this step if you use Carbons Virtual Prototype solution.
The core pieces of Carbon technology that allow customers to do advanced performance optimization, is our integration with ARM Fast Models and Carbon's Swap & Play. Our integration with the ARM Fast Models allow customers to get increased simulation performance in selected components during periods of time when accuracy isn't critical. Swap & Play dynamically allows the ARM Fast Model components to be swapped out in favor of cycle accurate components when accuracy is required, i.e. the benchmark. Essentially this boils down to performance when you want it and accuracy when you need it. In the example system below, I started with the Cortex-A15 Linux CPAK. After booting Linux, I create the Swap & Play checkpoint corresponding to the start the Dhrystone benchmark. If you have many different checkpoints, each representing different benchmarks or interesting points that you have created, this is not an issue. Managing all of these are simple, since SoC Designer Plus provides a checkpoint manager for Swap & Play to organize a user’s checkpoints.
After restoring the Swap & Play checkpoint I created intthe cycle accurate system, I then complete the simulation running the benchmark. Below is a screen shot after turning on the profiling features available in Cortex-A15. If we pause for just a moment, think about what is actually being shown here. We are looking at actual HW events and statistics running a benchmark ontop of an OS with a virtual prototype. You no longer have to wait for an FPGA of the system to do this level of analysis.
Furthermore, without Swap & Play from Carbon and the accuracy of our solution, you almost certainly will either make an incorrect architectural tradeoffs or have an un-optimized system. If you were to find this in an FPGA prototype, do you still have enough time in your project schedule to go back and to have to re-validate and verify this architectural change? This means delays in your time to market and that costs you money.
Of course, one could always just over engineer the solution, but this will lead to extra power being used and increased size. Power consumption isn’t just important in the mobile application processor market space. It is important in all market spaces. Wasted power, means wasted money!
In the next few weeks you will be learning about advances Carbon has made in the Virtual Prototyping methodology with additional CPAK's, additional system simulation, profiling and characterization capabilities. To learn more about our how Carbon's solutions can help you with advanced performance optimizations or to learn more about booting an operating system with a virtual prototype please click below.
Carbon exhibited last week at ARM® TechCon™ in Santa Clara. True to form, it was a very successful show for us and generated a lot of interest at our booth. I also gave a talk together with Rob Kaye from ARM about how to create an ARM virtual prototype which has both speed and accuracy (You can view a copy of that presentation here or get it in whitepaper form) The talk was quite well-attended and had roughly 60 attendees who had far more questions than we had time for at the end (you can still ask them by the way, post them as comments on this blog and I'll respond)
Rob and I talked about the tradeoffs that seem to face all virtual prototype design teams: do you create a virtual prototype which runs fast and sacrifices cycle accuracy or do you create one which is cycle accurate but lacks the speed to develop software? The answer to the question is, of course, yes. If you do it right, you can create a single virtual prototype that is both fast and accurate.
The traditional approach to get a fast, accurate virtual prototype is to compromise a bit of speed and a bit of accuracy. This way you have a model which theoretically has the best attributes of both and none of the downsides. In reality, the approach that would seem to please everyone typically upsets everyone instead. It's too slow for use by software teams and too inaccurate for use by architects and firmware engineers.
This isn't to say that the AT approach hasn't been tried for IP models. Look back four or five years in time and you would see AT models available from both ARM and MIPS. As model complexity grew however both companies abandoned the creation of AT models and instead began offering fast functional models (such as ARM's Fast Models) which make no attempt to model cycle accuracy but have enough functional accuracy to enable software binaries to run on them without modification. Both ARM and MIPS then partnered with Carbon Design Systems to offer cycle accurate models of their processor IP. If you take a look at the ARM IP section on Carbon's IP Exchange web portal, you'll see that we offer cycle accurate models of all of ARM's currently available processor models and also provide demo copies of the corresponding ARM Fast Models when they're available. (The recently announced Cortex-A57 and Cortex-A53 aren't there yet but availability will be announced after RTL is available from ARM.)
Since Carbon obviously works closely with ARM during the creation of our IP models, we spend a good deal of effort to make our cycle accurate models interchangeable with the corresponding Fast Models. This integration works well enough that you only need to create your virtual prototype once in SoCDesigner Plus regardless of the level of abstraction at which the prototype will run. SoCDesigner Plus understands the mapping between Carbon's accurate models of ARM IP and the corresponding ARM Fast Models and will automatically create the fast system representation from the accurate one. This way, you're not duplicating design and validation efforts creating separate virtual prototypes. Create the system once and the unified virtual prototype can the be used by software engineers and architects alike.
Having a single virtual prototype which can run at different speeds and accuracy levels is a great feature but SoCDesigner Plus can take this one step further and enable the virtual prototype to begin running based upon Fast Models and then switch over to the 100% accurate representation at any breakpoint. Now a user doesn't need to wait hours (or possibly even days) for a cycle accurate virtual prototype to boot Linux or Android and get to a point of interest to start debugging or system analysis. Instead, you can get to the same point in seconds using Fast Models and then switch over to a 100% accurate representation to continue execution. This technology, which we call Swap & Play, doesn't have to just create a single checkpoint either. You can create multiple checkpoints to enable separate problems to be analyzed or different drivers developed.
This is not a new development. Carbon has been shipping Swap & Play for a while now. I've blogged about it before (here and here), published an article about it in ARM IQ, and our own Carbon Performance Analysis Kits (CPAKs) for the Cortex-A9, Cortex-A15 and Cortex-A7 all contain support for it (as will future ARM processors) This Carbon exclusive technology has enabled our customers to solve problems with virtual prototypes that would previously have required expensive hardware prototype solutions. Carbon's unique Swap & Play technology is enabling users to have speed when they want it and accuracy when they need it.
Earlier this year I blogged about DAC and the yearly decision process we go through every year before deciding to attend. After all, it’s tough finding engineers focused on system level design issues in a conference primarily populated by people looking to solve backend design problems in hardware. Based upon that, you’d think that embedded shows such as Design East and West (formerly the Embedded Systems Conference) would be a better fit. Heck, they even have, or at least had, the word “systems” in the conference title. Just as DAC was too hardware focused however, the Design East and West conferences go too far the other way for us and are centered almost entirely on software. This would be fine, but that software is typically targeted at existing devices and microcontrollers, not on new system on chip designs. We’ve exhibited there a few times but have never had much success connecting with new potential users. It seems that the other folks offering virtual prototype focused solutions have come to similar conclusions since they have similarly stopped exhibiting.
Between the backend hardware focus of DAC and the microcontroller software focus of Design East and West however there is a “just right” conference for us, or in this case, set of conferences. (Those of you who grew up to the story of Goldilocks and the Three Bears hopefully recognize the analogy by now. Sure, I had to reach a bit for it but what can I say, I’m an engineer, not a writer.) The just right conference for us is ARM TechCon and the follow-on series of ARM Technical Symposia. There’s a good balance of emphasis on both the hardware and software complexities being faced.
ARM has even taken the additional step of breaking TechCon down into two sub-conferences. One day focused on system on chip design and two days centered on software. As you’ve hopefully noticed by now, Carbon has a number of solutions designed to accelerate ARM system on chip designs so this makes it the ideal venue for us to meet with our current and future customers. We’re even jointly presenting with ARM during the first day technical sessions at ARM TechCon (High Performance or Cycle Accuracy? You Can Have Both!) on Tuesday, October 30th in room 203.
We'll be exhibiting our virtual prototype solutions for ARM Cortex-A15, Cortex-A7, big.LITTLE and the rest of the ARM IP portfolio. Our recently annouced Carbon Performance Analysis Kits (CPAKs) have gotten a great reception from our customers so I'm looking forward to talking about them in person with even more people.
We’ll be following up our participation at TechCon with a presence at the following ARM Technical Symposia:
November 20th – Seoul, Korea
November 22nd – Taipei, Taiwan
November 23rd – Hsinchu, Taiwan
November 26th – Shanghai, China
November 28th – Beijing, China
November 30th – Shenzhen, China
December 6th – Tokyo, Japan
December 13th – Paris, France
We won’t be presenting at the symposia but we’ll have a booth in each city. If you’re at the conference drop by and say hi, we’d love to talk with you about what we’re doing at Carbon and hopefully how we can be the just right solution for your ARM-based SoC project.