FREE Insights & Analysis

Get the world's leading discussions on virtual prototyping delivered to your inbox

Your email:

FREE Virtual Prototype Resources


                Virtual Prototype



Follow Me

Tips, Ideas, Discussion for SOC Virtual Prototypes

Current Articles | RSS Feed RSS Feed

Running the Latest Linux Kernel on a Minimal ARM Cortex-A15 System

Linux® has become a popular operating system in embedded products, and as a result is experiencing increased usage for system design and verification. This means that a new set of people, who are not traditional embedded software engineers, need to learn to work with Linux to accomplish daily tasks.

For system design tasks, Carbon users generally proceed from bare-metal software benchmarks such as Dhrystone and CoreMark to running benchmark applications on an operating system. If the design uses ARM® Cortex™-A series processors, the most common OS choice is Linux. Linux has come a long way with respect to ARM Architecture since Linus Torvalds made his famous complaint about ARM support back in March, 2011. The Linux Device Tree has made it very easy for those of us who work with simulated hardware, and often partial systems, to be able to run Linux with almost no changes the kernel source code.

As I mentioned in my previous blog, I recently joined Carbon.  I decided the best way to learn would be to use the tools the way a customer would.  I’ll talk below about the steps I took to get Linux running on a minimal system using the ARM Cortex-A15 processor.  If you’re using a Cortex-A15, you’ll get even more benefit as this platform is available for download as a Cortex-A15 CPAK.

Why a Minimal System?

It’s easier to start with small system and iteratively add complexity – add peripherals, enable drivers, and verify at each phase that the system functions as expected. Starting off with a larger system introduces the difficulty of extracting hardware; it can be tricky to identify the software that relies on the hardware you want to remove.

It has also been my observation that many engineers designing and using leading edge SoCs are keenly interested in the performance details of key parts of chip such as the processors, interconnect, memory controller, and GPU. These processor sub-system architects don’t always want or need a lot of slower peripherals that are typically assumed to be present in systems running Linux.

Initial Challenges

The primary goal is to identify the minimum hardware needed to run Linux on an ARM Cortex-A15. It should be possible to run Linux with nothing more than the A15 CPU, memory, and a UART, so I set out to try it.

A secondary goal is to use the latest kernel from and change as little of the Linux source code as possible. Minimizing source code changes makes it easier to update to new versions of Linux as they are released.


Simulation speed is far more important than hardware accuracy when experimenting with Linux configurations to confirm a working kernel. This is an ideal situation in which to generate an ARM Fast Model design from the Carbon SoC Designer Plus canvas. After the kernel is working with the Fast Model design, it is easy to run the Carbon cycle accurate simulator, sdsim, for benchmarking and utilize Swap & Play to confirm a fully working virtual prototype that is both fast and accurate.

To meet the goal of changing as little kernel source code as possible, I started from a currently-supported platform, the ARM Versatile™ Express, then configured the kernel to use only the minimal hardware. New kernel versions will continue to support Versatile Express, making future upgrades easy to do.

Hardware Design

I decided to call my new hardware design the a15mini to indicate a Cortex-A15 system with minimal hardware. The Versatile Express design requires specifying the memory map and interrupt connections. The memory for the Cortex-A series Versatile Express is from 0x80000000 to 0xffffffff. The first PL011 UART is located at 0x1c090000 and uses interrupt 5 on the A15 IRQS[n:0] input request lines connected to the internal Generic Interrupt Controller (GIC). The Cortex-A15 needs to have the base address for the internal memory mapped peripherals (PERIPHBASE) set to 0x2c000000. The only other relevant information is that the UART runs from a 24 MHz reference clock.

Creating the a15mini with SoC Designer consists of instantiating the models and connecting them on the canvas using sdcanvas. Using cycle accurate models means more detail is needed to create the design. Instead of just the CPU, simple address decoder, and memory I used the ARM CCI-400 Cache Coherent Interconnect and the NIC-301 interconnect. These models, along with the Cortex-A15 are built using Carbon IP Exchange, the web-based portal that builds models directly from ARM RTL code. It’s pretty amazing to think that I answer a few questions or submit an XML file from AMBA Designer and get back a simple to use model in the form of a .so file and know that it was generated from a very complex RTL design of a processor like the Cortex-A15. It’s as if I’m using millions on lines of Verilog code and I never see any of it.  

The design is shown below:

 ARM Cortex-A15 Minimal Linux CPAK Screenshot

 Linux Preparation

Creating a Linux image for the a15mini is a little more complex than the hardware design procedure.  

There are many ways to prepare Linux, but at a minimum the following items are needed:

  • Boot Loader
  • Kernel Image
  • Device Tree Blob
  • File System

For this experiment I decided to make things as easy to work with as possible. The most straightforward way was to use a single executable file (ELF file) containing all of the above items; anybody who wants to run the platform needs only this one file to represent all of the artifacts. A drawback of this approach is that the file must be regenerated when any of the items changes, but it creates a generic solution that can be run on any kind of simulator.

Kernel Image

I downloaded Linux 3.13.1 from as the starting point. This was the latest kernel at the time I did the initial simulation, but new versions are released frequently.

Kernel Configuration

The default configuration for the Versatile Express is found in the Linux source tree at: arch/arm/configs/vexpress_defconfig

First, I use the Versatile Express configuration as the baseline by running:

$ make ARCH=arm vexpress_defconfig

File System

To get up and running quickly, I cheated - I copied the file new-buildroot-rootfs.cpio.gz from the Carbon ARM Cortex-A9 Linux CPAK and renamed it fs.cpio.gz. (A future article may cover the various ways to make file system images.)

Customizing Kernel Configuration

To create the single executable file with all of the needed artifacts, I needed to embed the file system image in the kernel, and append the Device Tree Blob at the end of the kernel image.

To embed the file system image in the kernel, you can use any of the Linux configuration interfaces. I tend to use menuconfig:

$ make ARCH=arm menuconfig

I navigated to the General Setup menu (see the image below), scrolled down to “Initramfs sources file(s)” and added the name of the file system image, fs.cpio.gz. I put this file at the top of the Linux source tree so no additional path is needed.

fs config

To append the Device Tree Blob at the end of the kernel image, access the Boot options menu item “Use appended device tree blog to zImage (EXPERIMENTAL).” I enabled this to append the .dtb file at the end of the zImage file (shown in the image below). 

append dtb config

While I was in the Boot options menu I also set the Default kernel command string by adding root=/dev/ram and earlyprintk. This specifies to use a ram based root file system; the only possible choice since no other storage is included in the hardware design. There are many ways to set the default kernel command string, but this approach works for well for this application, in which we want to link everything into a single elf file.

Source Code Changes

The bad news is I wasn’t able to run the 3.13.1 kernel on the a15mini without any kernel source code changes. The good news is I came within 1 line of the goal. I needed to edit the file arch/arm/mach-vexpress/v2m.c to remove the line that configures the kernel scheduler clock to read a time value from the Versatile Express System Registers (this peripheral has a register which provides a time value).  To achieve my goal of including a minimum of hardware, I wanted to forgo the System Registers.

The line I removed is in the function v2m_dt_init_early() and is line number 423, the call to versatile_sched_clock_init().

describe the image


Now the source tree could be compiled. I work on Ubuntu 13.10 and use the cross-compiler with the GNU prefix arm-linux-gnueabi, so my compile command was:

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4

The most important compilation result is the kernel image file arch/arm/boot/zImage.

Device Tree Blob

The Versatile Express device trees for ARM Fast Models are available from The easiest way to get the files is to use git:

$ git clone git://

The file that is the closest match for the a15mini in the fast_models/ directory rtsm_ve-cortex_a15x1.dts

This file also includes a .dtsi file named rtsm_ve-motherboard.dtsi. The main work to support the a15mini hardware system was to modify these two device tree files so they match the hardware available by removing all of the hardware that doesn’t exist.

So first, I edited rtsm_ve-motherboard.dtsi t1to remove all of the peripherals that are gone from the a15mini such as flash, Ethernet, keyboard/mouse controller, three extra UARTs, watch dog timer, and more. I left the original structure, but shrunk the file all the way down to just the UART!

Architected Timer Support

Linux needs a timer to run. I removed the SP804 timers that are present in the Versatile Express and the System Register block that provides the time value for the scheduler clock. Replacing these, I want to use an internal timer in the A15, sometimes called the Architected timer, to keep the hardware design as small as possible.

Support for the ARM Architected timer is already present in rtsm_ve-cortex_a15x1.dts, but I found that it didn’t work right away when I commented out the call to versatile_clock_sched_init(). The kernel crashed at the point of setting up the Architected timer, and printed a message that there was no frequency available. By looking through other device tree files, I found that some armv7-timer entries had a clock-frequency attached to them, while the Versatile Express entry did not. After adding the clock-frequency to the timer in rtsm_ve-cortex_a15x1.dts, the timer worked, and the Linux works as expected with the internal timer. No external timer is needed!

Here is the timer entry with the clock-frequency added:


I have attached the two device tree files so you can take a look at them as needed.

Using the Device Tree Compiler

After the two device tree files have been shrunk down to support only the hardware available on the a15mini, they can be compiled into the device tree blob using the device tree compiler, dtc, which is available in the scripts/dtc directory of the kernel source tree.

I did not add anything to my PATH to find dtc, I just reached over into the kernel source tree and ran:

$ ../linux-3.13/scripts/dtc/dtc  -O dtb -o rtsm_ve-cortex_a15x1.dtb\ fast_models/rtsm_ve-cortex_a15x1.dts

Adding Device Tree to Kernel

To use the feature that appends the device tree to the Linux kernel image, I simply copied the zImage from the kernel tree.

$ cp linux-3.13.1/arch/arm/boot/zImage  .

Then concatenated the device tree blob to the end of the kernel image:

$ cat arm-dts/rtsm_ve-cortex_a15x1.dtb >> zImage

Boot Loader

There are many ways to create a boot loader, but to meet the goal of a single, all-inclusive executable file, a small assembly boot loader is the best. A great example of this is available in the ARM Fast Models ThirdParty IP package. This is an add-on to ARM Fast Models which contains all of the open source software that can be run on Fast Models examples. The majority of the package is the Linux source trees and file system images for different examples.

I selected two really useful files from the RTSM_Linux source code that comes with the ThirdParty IP package:

  • boot.S - The assembly file that serves as the boot loader. It’s small and easy to understand, and much easier to work with in cases where a full boot loader like u-boot is not needed.
  • - A linker script that specifies how to link boot.o (compiled boot.S) and zImage into a single executable file.

The addition of these two files, combined with the modified zImage (including embedded file system and concatenated device tree) means that everything is now present in the single executable file.

I also made some minor adjustments to the Makefile, inserting my paths, compiler name, and file names so it would generate the file a15-linux.axf as the final output that will be used in simulation.

The updated Makefile is shown below:

boot make

Running Simulation

Because the interconnect is more complex than a simple address decoder, there are two ways to confirm the design is correct.

  • Run Linux on a fast version of the design
  • Run a small software program in cycle accurate mode

I would recommend both methods to make sure the design is functioning properly and there are no errors in connecting the interrupt, setting CPU model parameters, or other common mistakes made during design construction. My most common mistake is forgetting to set the PERIPHBASE address of the A15 to 0x2c000000.

Since I’m focusing on Linux I will describe how to run a fast version of the design.

From the Tools menu in sdcanvas select FastModel System…

 tools fast model

The Fast Model System Creator dialog will appear as shown below. Clicking the Create button will generate a Fast Model equivalent system automatically from the cycle accurate system.

fm system

This Fast Model system can be run in sdsim using the Simulation -> Simulate System … (or F5). When the simulator starts specify the a15-linux.axf file that was created in the last section and watch Linux boot in a matter of seconds. The terminal below shows 3.13.1 Linux with the machine type reported as Versatile Express.

sdsim fm boot

Swap & Play

Now we have established a working cycle accurate simulation and a Fast Model simulation for the same design, both of which can be run with the SoC Designer simulator, sdsim.

Working with Linux on a cycle accurate simulation is a great way to study all of the details of the hardware, software combination such as bus utilization, cache metrics, cache snooping, barrier transactions, and much more. It’s exciting until you realize that Linux takes about 300M instructions to boot and well over a billion clock cycles.

One alternative to waiting for a full cycle accurate simulation is to create Swap & Play checkpoints at various points of interest and then load the checkpoints into the cycle accurate simulation. To create a checkpoint run the Fast Model simulation and then stop the simulator using either a breakpoint or just hitting the Stop button at the place you want to stop, such as the Linux prompt. Use the File -> Save As menu in sdsim and select Swap & Play Checkpoint (*.mxc) as shown below:


The next dialog will ask for a name of the checkpoint and a location for the file. Enter any name and hit OK to save the checkpoint.

To load the checkpoint into the cycle accurate simulation load the design into sdsim and then use the File -> Restore checkpoint view…

Select the checkpoint saved in the Fast Model simulation and it will load into the cycle accurate simulation. There is no need to even load an image file for this case since it will be restored by the checkpoint. The disassembly window and the register window will show the same location that was saved from the fast model simulation. I saved a checkpoint at the Linux prompt and restored it into the cycle accurate simulation and I can even see I’m sitting at the WFI instruction in the Linux idle loop. 


A debugger such as RealView or DS-5 can also be connected to start source level debugging.

An alternative trick I use is the addr2line utility to find out where in the code the Disassembly or Register Window is showing. This is useful to just take a peek at the current location without starting the full debugger. For the disassembly window above I do:

$ arm-linux-gnueabi-addr2line -e vmlinux 0x80014524 /home/cds/jasona/

Now I can see the source code for the current location as a quick check. Sure enough, I’m sitting at the Linux idle loop as expected since nothing is happening sitting in the shell at the prompt. Here is the code:


It’s easy to imagine using swap & play to load checkpoints and do cycle accurate debugging as well as performance analysis for benchmarks. Carbon users typically run benchmarks such as Dhrystone, CoreMark, and LMbench as Linux applications.

So far, this was done using a single core A15 CPU. It’s interesting but there are no actual systems that use a single core A15. Next time I will show how to extend the a15mini for multi-core simulation by moving to the ARM Cortex-A15x2 CPU and running SMP Linux, again with as little hardware as possible. The dual-core A15 matches one of my favorite machines, the Samsung Chromebook.

As I mentioned at the beginning, all of the work that I’ve done here porting Linux and setting up a system which can be used with Swap & Play is available as a CPAK on Carbon IP Exchange.  You can use this system to port your own version of Linux or customize the hardware configuration to match that of your own design.

Jason Andrews

Optimization of ARM Cortex-A15 and AMBA4 Designs using a Virtual Prototype      Modeling Techniques for Virtual Prototypes Whitepaper     Request More Information


Getting Up and Running with the ARM Mali-450 MP GPU (Part 3 of 3)

In the previous parts of this blog series, I discussed the procedure of Carbonizing the RTL of an ARM Mali-450 GPU and building an SoC Designer component, followed by the description of a virtual prototype environment that runs cycle-accurate integration tests provided by ARM on the Mali-450 GPU using a Cortex-A15 based system.

This part describes a Linux-based Carbon Performance Analysis Kit (CPAK) containing a Cortex-A15 based system with an ARM Mali-450 GPU. This CPAK enables the user to integrate the Mali GPU drivers into the Linux Kernel and registering the Mali driver while booting Linux on the virtual prototype environment.  We’ll do this by combining the speed of ARM’s Fast Models together with the accuracy of the Carbonized GPU model in the same system.

Partitioning the System

The block diagram in Figure 1 shows the Mali-450 CPAK that we’ve been working with so far.  This system is comprised entirely of 100% accurate models which enable us to see the impact of our IP configuration settings and bare-metal software. As we migrate to higher level tasks such as driver development, we can map portions of the system to more abstract models in order to execute at faster speeds. SoC Designer automatically understands the relationship between Carbonized models of ARM IP and their Fast Model equivalents so all we need to do to is tell the Fast Model Generator tool in SoCDesigner which models we would like to execute accurately, and which ones should be represented as Fast Models.

 ARM Mali-450 CPAK Block Diagram

The dotted line indicates the portion of the virtual prototype we’d like to execute as ARM Fast Models. The rest of the system will continue to execute as cycle accurate models. Once we’ve identified the models for each domain, SoCDesigner will automatically build a new virtual prototype representation containing both Fast Models and Carbonized, 100% accurate models. Figure 2 shows a block diagram that describes such a system. All of the necessary transactor logic to map between the abstract models and the accurate models is inserted automatically.


Using the CPAK with ARM Fast Models

Now that we’ve updated the CPAK to ARM Fast Models, let’s put it to use. Since the virtual prototype will run at ARM Fast Model speeds when the accurate models are not being used, we can now use this platform to quickly boot Linux and integrate the Mali device driver.  In order to do this, I integrated the Mali Linux drivers into the Linux kernel that was used as part of the A15 Linux CPAK. The Linux drivers are part of the Mali-450 GPU Linux Driver Development Kit (DDK) provided by ARM. The DDK contains the Mali Linux device drivers and the base drivers. It also provides links to download the OpenGL drivers, specific to Mali-450 GPU. The Integration Guide was used, along with minor modifications specific to the virtual prototyping environment, to integrate the software into the Linux kernel. The integration procedure involved the following steps:

  1. Integrating and building the Mali Linux device driver: The user specifies the macros corresponding to the GPU configuration parameters, along with configuring the Mali GPU memory, the framebuffer memory and power management options. The device driver was built as part of the kernel image, but it can also be built as a kernel module.
  2. Building the OpenGL and GLES libraries and adding them to the root file system.
  3. Building the GPU benchmarks and adding them to the root file system.
  4. Building the kernel image that contains the Mali device drivers, base drivers, the OpenGL drivers and the GPU benchmarks.


This CPAK can prove to be useful to device driver developers who could step through the process of loading the Mali device drivers during boot-up. Figure 3 shows the messages printed by the console while loading the Mali Linux drivers with debug messages turned on. It describes the whole procedure of registering the drivers that starts with initializing the Mali memory system and ends with initializing the platform device, provided all intermediate steps pass. The intermediate steps define the settings for the framebuffer, the dedicated and shared memory. Each of the internal components of the Mali-450 that include the L2 cache, the Graphics Processor, the Pixel Processors and the Dynamic Load Balancing unit are then created and the corresponding base addresses are defined.

 describe the image

The Results

With the driver ported to Linux, it is now possible to quickly boot the OS in just over two minutes and then start processing frames with 100% accuracy.  The benchmark that uses the Mali base drivers runs test suites on:

  1. GPU memory allocation
  2. Initialization of the Graphics Processor
  3. Initialization of the Pixel Processors
  4. Running vertex shader jobs
  5. Running rendering jobs that draw a simple triangle.

Getting the whole system up and running took me 3 weeks, but all of the work that I’ve done is incorporated into the Mali-450 CPAK meaning that you can duplicate the same results within minutes of downloading the package. The package includes an app note that describes the steps that the user would need to take to use pre-built binaries to perform the integration process.


  ARM Cortex\u002DA9 Virtual Prototype Running CoreMark Benchmark        Optimization of ARM Cortex\u002DA15 and AMBA4 Designs using a Virtual Prototype       ARM big.LITTLE CPAK Virtual Prototype demo

Getting the Most Out of Advanced ARM IP

Fall is always a busy time of year here in New England where Carbon is based. The students are back at the colleges and universities, leaf peepers clog up all the roads and for a brief period all four major pro sports teams are playing (Go Red Sox!).  Fall is also Carbon's busy time for conferences as well.  Although we don't typically exhibit at many tradeshows, we do try and have a presence at most of ARM's technical conferences.  ThisARM Techcon 2013 flurry of activity starts with ARM® TechCon™ in Santa Clara and typically ends six weeks later with the European Technical Symposium in Paris.  Although it makes for a large amount of travel, it's a great opportunity to speak with lots of ARM designers and programmers about the challenges they're facing.

Carbon will be present at most of these conferences (the full list will be at the end of the blog) and making presentations at most of them as well.  At ARM TechCon, we'll be doing a joint presentation with ARM entitled "Getting the Most Out of the ARM CoreLink™ NIC-400."  This presentation will go over some of the highlights of ARM's NIC-400 product and then discuss a two-step methodology to optimize this crucial piece of IP to best attain your design goals.  After last year's ARM TechCon joint presentation with ARM, "High Performance or Cyle Accuracy? You Can Have Both!" we made the corresponding whitepaper and presentation available for download and we will do that once again this year after the conference.

At the other conferences, we'll be presenting "Getting the Most Out of Advanced ARM IP" which will discuss methodologies to optimize some of ARM's other recently announced IP blocks such as the ARM Cortex™-A57, Cortex-A53, Cortex-A15, Mali and several others.  While I don't want to give away the entire presentation here, you can be sure it will probably make some mention of implementation accuracy, performance optimization and executing pre-silicon firmware.

If you're attending any of these conferences, please come by the Carbon booth and say hi.  We'd love the chance to talk with you about the latest work we've been doing to enable customers to optimize their ARM-based SoC designs and get the most out of advanced ARM IP.

Carbon will be participating in the following conferences

Date Event Location
October 30-31 ARM Techcon Santa Clara Convention Center, Santa Clara, CA
November 19
ARM Technology Symposium
Intercontinental Seoul Coex, Seoul, Korea
November 25
ARM Technology Symposium
Sheraton Hongqiao, Shanghai, China
November 27
ARM Technology Symposium
Sheraton Dongcheng, Beijing, China
November 29
ARM Technology Symposium
Ritz-Carlton Futian, Shenzen, China
December 6 ARM Technology Symposium Tokyo Conference Center, Tokyo, Japan
December 12 ARM Technology Symposium  CAP 15, Paris, France 

   Optimization of ARM Cortex\u002DA15 and AMBA4 Designs using a Virtual Prototype          ARM big.LITTLE CPAK Virtual Prototype demo      LG Electronics Virtual Prototype Presentation

Standing Out in the Crowd, Differentiating Your ARM-based SoC

As system on chip (SoC) designs have grown from 10s to 100s of millions of gates, designers have had to go to great lengths to deliver designs which are well differentiated from the competition.  Whereas the majority of the content of previous generation SoCs may have been designed internally or created from scratch for a new generation, this is certainly not the case nowadays as the vast majority of intellectual property (IP) blocks are reused from previous designs or, more likely, purchased from external sources.  You only need to look at the fantastic market success enjoyed by IP companies such as ARM and Arteris to see how much the industry now relies upon third party IP to drive their system.  (The recent spate of purchases by Cadence® shCarbon Design Systems Trusted Path to Accuracyows how much importance they certainly see in this market segment.) As the third party content of the chip rises however, it becomes increasingly difficult to differentiate your SoC design from others in the marketplace.  If everyone is designing using an ARM® Cortex™-A15 processor, Arteris® FlexNoC™ interconnect, Cadence memory controller and Imagination Technologies® GPU and you’re using the same IP how can you differentiate your system design, especially if you’re all using the same fab?

As the leading provider of 100% accurate virtual IP models and systems (including models from ARM, Arteris, Cadence, Imagination Technologies and many more), Carbon has seen this exact problem play out time and time again throughout our customer base.  Differentiation is achieved in many ways of course, but the most advanced approaches typically focus around a few key design areas (and I’ll confine my discussion to the front end of the design cycle here since it’s where you can make the most impact): configuration, integration, and power instrumentation. 


IP configuration seems intuitively obvious, but the complex interactions between all of the various options in any single IP block can lead to a huge difference in the achievable performance.  Tie that block in with a number of other, similarly configurable blocks and the possible options grow exponentially.  Leading edge designers are taking a two-stage approach towards IP configuration: stand-alone and integrated.  Stand-alone optimization is just what it sounds like.  The IP block is placed in a very simple test system where the IP block is modeled along with traffic generators and receivers on all of the various relevant ports.  This is intuitive of course because it’s how many blocks are verified.  Instead of focusing on verification however, the emphasis is on the interplay of various configuration options. The IP options are exhaustively modeled and then subjected to representative system traffic.  This approach can quickly point out which options will best meet your target specifications. It seems intuitively obvious but only a small number of IP teams were employing this approach for non-interconnect IP when we first started working with them.  A recent design was able to double the performance on memory reads by applying these methods to a memory controller configuration that was already shipping in an earlier generation product.   


In the interconnect world, the story is a bit different as it’s quite common to construct a virtual platform containing traffic generators combined with real stimulus in order to try out various configuration settings.  Carbon has partnered with Arteris to make pre-built virtual prototypes (called Carbon Performance Analysis Kits™ or CPAKs™) available for exactly this purpose.  These CPAKs are available containing various ARM processors (including the ARM Cortex-A15, Cortex-A7 and Cortex-A9 with more to come) as well as multiple traffic generators all tied together either with generic interconnect components or IP from ARM or Arteris.  Bare metal and OS is software is bundled as well to configure and drive the system.  It’s an ideal starting point for quick optimization or customization to more closely reflect your own design.

CPAK featuring an ARM Cortex-A15 and Arteris FlexNoC

Screenshot of a CPAK featuring a dual-core ARM Cortex-A9 together with an Arteris FlexNoC and multiple configurable traffic generators


Achieving the best system performance is not possible on a per component basis and must be looked at within the context of a system.  Traffic generators can give a first approximation of real traffic but the only true way to validate the performance characteristics of a system is to pull it together and have it execute real software.  Now the true impact of various configuration settings can truly be seen and problems uncovered.  The interaction between high priority arbiters and high bandwidth components seems to be an especially problematic one as the performance is tweaked to make sure that components aren’t starved and power isn’t wasted (overdesign is just as bad a problem as under-design when you’re on the bleeding of mobile devices) 

The impact of software here cannot be overstated as it is ultimately the system software that determines how well the overall system will function and how much power it will burn.  Ensuring that the software can correctly interact with the hardware to achieve the desired performance is a key milestone and leading edge designers subject our virtual prototypes to billions of bare-metal and OS-driven cycles running benchmarks of various flavors to not only stimulate the systems in various software-driven ways but also validate that the product being built can live up to the marketing claims they’ll be making.  Have you noticed that certain companies seem to always produce the fastest phone chips?  They’re the ones that don’t look at speed optimization as a hardware problem or a software problem but rather an integrated system problem.

Here's a whitepaper published a few years ago by Samsung discussing how they used a cycle accurate virtual prototype to optimize the performance of their software even after the silicon had already been fabricated.  Ideally you do this step earlyier in the design so you can impact the hardware decisions as well but the results which Samsung obtained are still impressive.


Software Driven Power Optimization

Hold it.  Power?  Didn’t we promise to focus on front-end issues?  Although traditionally relegated to the back end of the design cycle, power decisions are being moved forward in the design cycle and many leading edge SoCs today have a concerted effort to measure the consumption of their devices while it’s still straightforward to make design decisions to reduce the power consumed by the system.  This is typically done by executing system software on an accurate virtual prototype and then tracking the various power metrics.  This can be implemented using a straightforward approach such as dumping waveforms while executing system software and then analyzing these results in your favorite EDA power tool.  Software-based power vectors give a much better indication of how the device will actually perform and enable much more meaningful power decisions than are possible with vectors derived from an RTL testbench. 

More sophisticated customers have adopted an instrumentation flow to enable power analysis. Instead of dumping waveforms during execution and then running these through a power analysis tool, they do a preliminary step of creating power number which correspond to the various power states of each model and then instrument the model using callbacks to dynamically track these states throughout the system.  This enables runtime power analysis to be done without requiring waveforms and third party power tools.  It also means that the system itself runs substantially faster which is always a big benefit.  We’ve published a whitepaper which details the implementation steps required for both steps.


Evolving design techniques and an increased reliance upon third party IP have eliminated many of the approaches that designers have used to differentiate their products and distance themselves from the competition.  When one door closes however, another opens and the design areas discussed here are being used today by leading-edge design teams to attack this space and create cost-effective, differentiated products.  Adopting these approaches along will enable your designs to keep pace and stand out in the crowd. 


Note, this post was adapted from an article we included in the most recent newletter from Arteris.

System Level Performance Optimization

Much has been written about the ways to accelerate an SoC design schedule.  If you added up all the marketing claims made by the various EDA companies on time to market savings, you’d end up being able to ship your advanced SoC months before you even conceive of the idea.  We’ve been working on a series of blogs lately focused around SoC design issues and questions that were laid out in the Asking the Tough Questions blog a few weeks ago.  So far we’ve talked about choosing IP, configuring it correctly and optimizing your memory subsystem.  Today we’re introducing software into the equation.  Instead of talking about how this can pull in your overall design schedule though (since that gets written about all the time) I’d like to focus on how software integration can be a true product differentiator.  Driving your SoC under development with system level software doesn’t just get things done more quickly. If done properly, you can use your software to drive the validation and optimization of the SoC being designed and create a more differentiated system-level offering.

Assembling the System

The first step in the process of course is to assemble the system that you’ll be optimizing.  This can be a stumbling block of course since, very often, the people optimizing the system don’t have the intimate design and modeling knowledge to prototype it.  This leads to a few possibilities: bite the bullet and learn about all the IP and steps involved to tie it together and configure it; don’t worry about the lower level details of the system just abstract them away; take an existing system model from someone else and customize it to meet your needs.  Since we’ll be talking about optimizing a system early in the design process, we’ll confine the discussion to virtual prototypes since physical prototypes and emulators are available too late in the design cycle to enable system level optimizations.

There are pluses to all of these approaches of course.  Learning about all of the IP being used in the system and then assembling a virtual prototype has a definite allure.  After all, it’s good to know about the blocks in your system and everyone likes learning new things.  Of course, there’s a downside to this as well.  By the time you know enough about the complexity of each block in order to tie a system model together (and possibly needing to create virtual models for some IP), the design window is likely to have passed. 

Abstracting away all the lower level implementation details has a great deal of appeal and it’s how a lot of virtual models are created.  This approach is great for high level software development since a properly created model will be functionally equivalent to the actual IP but just without the timing details.  This enables the model to run fast and also be developed quickly without knowing a lot of the lower level details about the IP block.  The downside of course is that these lower level timing details are really needed if we want to optimize the system level performance of the system.  You can try and inject some level of timing details to mimic the performance of an actual IP block but this quickly leads to approximations built upon approximations built upon a model which never attempted to model timing accuracy in the first place.  If you really need to see the performance impact of code running on a multi-issue, out or order, dual pipelined, multicore processor, are you going to trust timing approximations which you insert into your high level model which doesn’t even model the pipeline, let alone all of the speculatively executed code needed to fill those pipelines?  In the end, a more accurate model is needed if we’re truly going to optimize system level performance.

If hardware prototypes are available too late, creating the virtual models yourself takes too long and using high level virtual prototypes leads to incorrect answers, how can you quickly create a prototype which is accurate enough for system level optimization? 

Speed and Accuracy

This, of course, is where Carbon enters the picture.  SoCDesigner Plus enables you to combine 100% accurate models from leading IP vendors such as ARM, Imagination Technologies, Cadence, Arteris and others which are available from Carbon’s IP Exchange web portal.  These can be combined together with SystemC models, user RTL models compiled using Carbon Model Studio and also ARM’s Fast Models. Support for ARM Fast Models isn’t unique to Carbon of course, every vendor offers this.  Only Carbon however offers the capability to mix these fast, functional models with their 100% accurate equivalents and boot Linux or Android in seconds and then debug software or profile performance with 100% accuracy.  We’ve talked about this before a few times (here and here for example)  So instead of talking about the feature itself, let’s dig a bit deeper and talk about how this feature can be used to differentiate your SoC design.

A15 Linux CPAK Profiling Screenshot resized 600

Figure 1: Profiling the Linux boot process on a Cortex-A15 CPAK

A System Approach

The key to optimizing system performance is to address it as a system problem, not a hardware or software issue.  Teams spend huge amounts of effort optimizing the performance of the hardware subsystem and an equally large amount of time enhancing the speed of the software but it’s far too rare that these two tasks are performed concurrently.  This means that there are missed opportunities for additional levels of optimization. 

Having a virtual prototype which is both fast and 100% accurate enables this system level optimization to take place.  The typical first step is to port one of the industry standard benchmarks to the platform.  This is always a valuable step since the benchmark will be run on the finished silicon to provide marketing positioning.  Why not execute the same benchmark during the development process to identify areas for optimization? 


The initial port of software benchmark code can be easily be done on a high level representation of the system.  In our case, this means running with ARM Fast Model representations of the processor(s) and related IP instead of their Carbonized equivalents.  This fast platform is great for functional code porting but doesn’t reveal much about system performance.  Therefore, once the code is ported and up and running it’s time to start running accurately to really start getting the value from running the benchmark.  Swap & Play is great for this task since you can run quickly up to the point of interest, just before the benchmark starts for example, in the Fast Model representation and then switch over to the 100% accurate representation of the system to get correct results.  You can then use these results to optimize the interaction of the various components of the system.  SoCDesigner’s performance analysis visualization tools enable you to quickly identify performance bottlenecks and how they correspond to other activity in the hardware or software.

System software can be similarly optimized.  When executing software on the accurate representation of the system it is straightforward to track the performance characteristics of both the hardware and software using SoCDesigner’s visualization tools.  Now it is quite simple to draw correlations between software routines and their impact on overall system performance.  This is obviously of great importance as system complexity increases.  We’ve blogged previously about how Samsung used this approach with great success to optimize the software performance of a hybrid disk controller after silicon was already delivered.  The whitepaper discussing this is available as well.  This same approach is even more valuable when the hardware is still in a state where it can also be optimized.  After all, system optimizations are typically most effective when they’ve based upon hardware and software features.


The approaches below are just scratching the surface of the optimizations that are possible when running real software on real hardware in advance of silicon.  We didn’t even touch on the verification values which can also be obtained and there are numerous additional optimizations which can be uncovered as well.  We’ll leave those for a future blog.


Interconnect Optimization

Bill’s last blog summed up the tough questions SoC architects face at various phases of the design process and Andy’s last blog provided a great description of how Carbon customers tackle the many questions that arise during the IP selection process.  In this blog, I will move on to the interconnect and the questions involved with optimization to allow for a balanced SoC:  For performance critical paths, how can the bandwidth be maximized while minimizing the average latency?   How can less performance sensitive paths be managed without disrupting higher priority traffic?  How can the fabric strike the appropriate balance between throughput and latency for all paths through the bus?  What is the best way to isolate and eliminate performance bottlenecks?  How will cache coherency impact the interconnect traffic – and system throughput? 

Since this topic is a lot deeper than can be explored in a blog posting, you can get a much more in depth analysis by downloading our whitepaper on interconnect optimization.

Chicken or Egg?

The SoC architect has encountered a unique challenge with selecting and optimizing the system bus.  Usually the CPU core(s) and possibly the GPU are selected prior to interconnect but other IP blocks including the main memory controller, DMAs, general purpose peripherals are yet to be selected.  Thus the interconnect is to be designed and optimized with an incomplete understanding of the actual workloads that it will need arbitrate and load balance, as well as uncertainty of the latency and throughput constraints that the actual slave devices will impose.  It is no surprise that temptation to overdesign a low latency mesh is so strong, as Bill noted. 

The paradox is that the key design decisions in optimizing the fabric require an understanding of the final system.  Consider the performance sensitive path between the CPU and main memory.  The interconnect must harmonize with the main memory controller configuration and programming such that the memory bandwidth is maximized while the latency through the bus is minimized.  This design tradeoff is fundamental and must be accurately profiled, characterized, and displayed for the SoC architect.

SoC Designer Latency & Throughput

The relationship between latency & throughput visualized in SoC Designer

A comprehensive I/O prioritization scheme must be established such that the fabric can provide flow control and arbitration to deliver appropriate QoS for the application.  Moreover, real world dynamics such as transient shifts in traffic, head-of-line blocking, can completely disrupt the designed flow control of the interconnect.  Download the whitepaper for a more in depth look at impact of slave device latency variations on traffic prioritization.

Carbon customers have found effective ways of resolving this “chicken or egg” design dilemma of configuring interconnect to meet system performance objectives while providing robust QoS early in the design process, when the exact I/O workloads and target constraints are unknown. 

A Two-Phased Approach:

The first phase involves configuring and building a virtual prototype to quickly and easily isolate performance bottlenecks.

   Phase 1 – Architectural Exploration

  • 100% cycle accurate model of interconnect available from Carbon IP Exchange
  • Carbon AXI traffic generators
  • Flexible memory sub-system models

   Allows fast cursory optimization:

  • Broad traffic profiling and sensitivity analysis
  • Transaction tracing and back-pressure identification
  • “What if?” analysis

Arteris Flex NoC Architectural Optimization 

Architectural exploration platform with Arteris FlexNoC 

As IP blocks are selected later in the design process, the virtual protype can be reused and modified to incorporate 100% implementation accurate models in place of traffic generators:

   Phase 2 –Real World Virtual Prototype

  • 100% cycle accurate models of CPU & other IP
  • Real world application & traffic
  • HW & Software Interaction

   Allows architectural validation:

  • Accurate traffic profiling
  • Incremental optimization as more IP blocks are selected or designed
  • Cache coherency analysis
Arteris FlexNoC ARM Cortex A9 

Arteris FlexNoC based Real World Virtual prototype

Design tradeoffs explored earlier in Phase I can be revisited and validated against actual multicore CPU traffic.  The platform can be further modified as DMA IP blocks and memory controllers are selected.  

Read more about this iterative two-phased approach to interconnect optimization in our white paper.

Coherency & Accuracy

Hardware based cache coherency in SoC’s has introduced significant complexity to the interconnect optimization process.  Artificial workloads from custom traffic generators are not well suited to replicate coherency operations alongside the application workloads with high fidelity to the coherent workloads from actual IPs.  Virtual prototypes, such as the dual ARM® Cortex™-A15 with coherent ARM CCI-400 interconnect, allow for 100% implementation accurate simulation of coherent multiprocessor CPU traffic:

ARM CCI-400, Architectural Optimization, Cache Coherency

Carbon A15 bare metal CPAK multi-processor reference platform

The system above, from the Cortex-A15 Carbon Performance Analysis Kit (CPAK) allows for complete visibility of ACE traffic workloads in a SoC Designer environment.  The relationship between the hardware coherency handshaking events in the CCI-400 can be profiled and visualized alongside the bus traffic to get a deeper understanding of the correlation between hardware and ACE traffic.  

ARM CCI-400, Cache Coherency, Architectural Optimization

CCI-400 Coherency event profiling in relation to A15 ACE interface transactions

Understanding the impact of coherency operations within the interconnect and to the overall system performance can help architects partition their designs across multiple networks on chip. 

Download the white paper for an in-depth look into how cache coherency affects bus traffic and how the SoC Designer environment facilitates the analysis of this complex interaction.  


Asking the Tough Questions

It always seems that there are more questions than answers when embarking upon any new endeavor.  System on chip (SoC) design is no exception to this rule.  There are many approaches that you can use in an attempt to answer these questions.  How much can you trust the answers that you get and how much of your future do you want to gamble on those answers?  The best way to answer that may be by asking a lot moreVirtual Prototype Questions questions…

Assuming that you’re designing an SoC, a typical design process starts with IP selection.  There is a lot of IP available from different vendors.  How do you choose the IP that best meets your design needs from a price, performance and area perspective?  Do you need an ARM® Cortex™-A15 or can you meet your design goals with a Cortex-A9 instead?  Should you reuse the hardware codec from the previous design or replicate that functionality in software for this revision?  It’s tempting of course to just choose the latest, fastest IP available but this approach has all sorts of potential issues.  You may end up with a very overdesigned SoC.  This design may blow away its performance targets but at the expense of much higher IP costs and potentially much higher power consumption.


Once you’ve chosen your IP blocks, either from external or internal sources, they need to be configured to play together efficiently to meet your design goals.  Once again, it may be tempting to use a fully connected mesh to hook together all of your IP with wide, low-latency busses but you’ll probably spend more power than you need and have unnecessarily fast speeds on some data transfers.  How will cache coherency cycle figure into your bus utilization?  The latest coherency extensions to the AMBA bus protocols are dramatically changing the shape of system bus traffic.  How much will this impact overall system throughput?  If the design is partitioned into multiple fabrics (and most advanced designs do indeed contain multiple fabrics or networks on chip) what will be design impact of placing an IP block in one domain versus another?

The interconnect is of course only a part of the system performance equation.  How does your memory controller or controllers factor into the system performance equation?  Do you have too much latency on your processor to memory datapath?  Have you overdesigned the system in such a way that the data comes back quickly but you’re burning too much power?  What is the impact of this on your backend layout?  Does your interpretation of how the arbitration priority of your memory controller match what will really happen on the bus when all of the components are tied together? 


You can’t underestimate the impact of cache sizing and layout on your system performance as well, especially with the new AMBA Coherency extensions.  A few cache sizing and configuration decisions can have a dramatic impact on system performance and resultant snoop traffic.  How will your cache size impact system performance?  How much will that impact change when you start running different software?

Choosing and configuring your IP is only the start of your system design problem.  A system is far more than just hardware and there needs to be a way to get all the layers of the software stack up and running on the system as well, preferably well before silicon is taped out.  This is simple enough for high level application software but what about your firmware, drivers, diagnostics and other software that actually depends upon the hardware functionality of your new system?  How will you validate the impact of software on system level performance and power consumption?

There are typically substantially more engineers writing this software than there were hardware engineers to create and verify the hardware.  How will they all be enabled to develop and debug their portion of the system design?  When they uncover hardware problems (or at least problems that they blame on the hardware!) how will these problems be debugged?  Will the problem be thrown over the wall several times as fingers are pointed or will there be a common debug mechanism to enable the hardware and software engineer to work together?  Will that solution be affordable enough for everyone to use it or will it be a scarce, productivity-limiting resource with a signup sheet filled 24 hours a day? 

Finally, what happens if you want to enable your end customers to be productive on my system design before silicon is produced?  After all, they may need to write software for the system too and if they have to wait until after silicon to accurately develop this software you’ll need to wait longer to start making money on my design.  If this whole process takes too long, the design may even miss its market window entirely. 


These questions are not unique, nor are they new.  They’ve been growing in importance however.  As more and more chip designs migrate from custom ASICs to SoCs, the system-level design problems which used to be faced by just a few teams are now faced by almost everyone.  There are a variety of solutions in the market to address these problems ranging from RTL simulation to high level virtual prototypes to all manner of hardware prototypes.  Over the course of the next few weeks we’ll look into each of these areas in more detail, discuss the questions that our customers are asking and show how they’re getting answers that they can trust.


Getting Up and Running with the ARM Mali-450 MP GPU (Part 2 of 3)

In Part 1 of the blog, I discussed the procedure of building a Carbon model and an SoCDesigner Plus component from the RTL of an ARM® Mali™-450 GPU that recently became available in the market. In this blog, I will discuss a bare-metal system I built on SoC Designer Plus that consists of an ARM Cortex™-A15 processor, the interconnect subsystem, memory, a MALI-450 GPU model, an external interrupt controller and a UART display as shown in the block diagram (the block diagram corresponds to the one that was available with the Integration manual provided along with the RTL). I will also discuss how easily I could use the test harness provided by ARM as a part of the Platform Integration kit along with the RTL to run simulations that test the Mali-450 GPU SoCDesigner Plus model.

The ARM Mali-450 GPU Bare-Metal System:

ARM Mali 450 System Block Diagram

The system is built such that the Mali-450 GPU has two AXI master ports that are used to access shared memory through the interconnect. The Cortex-A15 processor programs the GPU by configuring the internal GPU registers through the interconnect subsystem and the APB3 interface. The GIC-400 is configured to generate interrupts corresponding to each processor internal to the Mali-450 GPU. The UART is used to display characters and in this case, it displays contents of internal GPU registers and test results.

The Mali-450 GPU model was configured to have four Pixel Processors (PPs), two L2 Caches, one Geometry Processor (GP), and no Power Management Unit (PMU). As discussed in my previous blog, this configuration was provided to a customer who had initially requested an So Designer Plus model as soon as the RTL was made available by ARM. 

The Test Infrastructure:

Along with the RTL, ARM also provided an Integration Kit that consists of tests to confirm that he GPU is correctly integrated into the system. These tests confirm if the internal APB registers of various components in the GPU (like the PPs, the GP, the L2 caches, the DMA unit, the Broadcast unit, and the Dynamic Load Balancing Unit) are readable and writable. The results of the tests are displayed on the UART. Tests that confirm if the IRQ pins are connected correctly, and the GIC-400 is properly configured are also available as a part of the test infrastructure. The boot code corresponding to the ARM processor, the driver code corresponding to the Mali-450 GPU, the GIC-400 and the UART were provided with the Integration kit.


system socd

I started building the system on SoCDesigner Plus by configuring the PL301 interconnect using ARM AMBA designer with address ranges defined for the UART, GIC-400 Interrupt Controller, memory, and the GPU (these details were provided in the Integration Manual). Considering the fact that the GPU required an address range of 192kb and each APB3 slot on PL301 provides only up to 4kb, 48 APB3 slots were defined and eventually merged using the APBMerger component.

After having configured the PL301 interconnect, I only needed to put the whole system together on SoCDesigner Plus Canvas and run the test application. By just providing the testname in the Makefile, the application file (axf file) to be loaded into the SoC Designer Plus simulator can be built. These axf files were loaded and the test applications executed immediately without any manual intervention. I had heard from colleagues that example code provided by ARM is easily ported using SoCDesigner Plus but it definitely was a pleasant surprise to me considering the minimal effort I had to make to get the simulations to run correctly. I did have to make few modifications to the driver code corresponding to the UART and GIC-400 since the Integration Manual referred to different versions of those peripherals, and the integration tests were devised for a different GPU configuration. Thanks to the excellent debugging and monitoring capabilities provided by SoC Designer Plus, it took me minimal effort to make those changes.   

As easy as all of this was for me to do, our customers won't have to repeat the steps since my work is available for your use.  The bare-metal system I put together on SoC Designer Plus is now available as a Carbon Performance Analysis Kit (CPAK) for customers. The integration tests described above are also bundled in and available as a part of the CPAK.

In part 3 of this blog, I will describe the procedure of patching the Mali-GPU linux driver on the Linux CPAK, and how simulations can be run on SoC Designer Plus where the GPU processes certain images and displays them on an LCD.

ARM Cortex-A15 Systems and Advanced Performance Optimization

For those that have been following Carbon's Blog for a while, you may have noticed that many of the entries have how customer have used Carbon Technology to solve or identify a problem they were having.  One example of such an entry, is Eric's blog about  DDRx Memory Controller Selection and Optimization.  In this entry Eric talks about the benefits of using  Carbon's DDRx memory solution and why this is a critical component in the virtual prototype. Another common thread has been around CPAK's and their ability to reduce the time  customers spend in getting the initial prototype up and running.   Pareena wrote an excellent entry about booting Linux on the ARM® Cortex™ A15.

What if you could combine the benefits talked about in these previous blog entries? What opportunity or opportunities would that open up to you?   Today I will talk about how Carbon's solution is addressing this for our customers worldwide. 

Advanced Performance Optimization 

During this phase, typically customers will want to run a set of benchmarks that they can run on top of an operating system.  An example of this might be Dhrystone, Coremark or tiobench on top of Linux.  For those not familiar, tiobench is a multi-threaded I/O benchmark that is used to measure file system performance.   All of these cases require a significant number of simulation cycles to complete.  Unfortunatly,  many people come to the conclusion this use case is not an option for a cycle accurate solution. This could not be further from the truth.  Instead they opt for Cycle Approximate models, which can lead to inaccurate and un-optimized SoC or they skip this critical optimizaiton step.   The good news is that you don't have to accept inaccuracy and you don't have to accept skipping this step if you use Carbons Virtual Prototype solution.  

The core pieces of Carbon technology that allow customers to do advanced performance optimization, is our integration with ARM Fast Models and Carbon's Swap & Play.  Our integration with the ARM Fast Models allow customers to get increased simulation performance in selected components during periods of time when accuracy isn't critical.   Swap & Play dynamically allows the ARM Fast Model components to be swapped out in favor of cycle accurate components when accuracy is required, i.e. the benchmark.  Essentially this boils down to performance when you want it and accuracy when you need it.  In the example system below, I started with the Cortex-A15 Linux CPAK.  After booting Linux, I create the Swap & Play checkpoint corresponding to the start the Dhrystone benchmark.  If you have many different checkpoints, each representing different benchmarks or interesting points that you have created, this is not an issue. Managing all of these are simple, since SoC Designer Plus provides a checkpoint manager for Swap & Play to organize a user’s checkpoints. 


Cortex A15 dhrystone 

After restoring the Swap & Play checkpoint I created intthe cycle accurate system, I then complete the simulation running the benchmark.  Below is a screen shot after turning on the profiling features available in Cortex-A15. If we pause for just a moment, think about what is actually being shown here.  We are looking at actual HW events and statistics running a benchmark ontop of an OS with a virtual prototype.  You no longer have to wait for an FPGA of the system to do this level of analysis. 


Cache Profile View Cortex A15

     Furthermore, without Swap & Play from Carbon and the accuracy of our solution, you almost certainly will either make an incorrect architectural tradeoffs or have an un-optimized system.  If you were to find this in an FPGA prototype, do you still have enough time in your project schedule to go back and to have to re-validate and verify this architectural change?  This means delays in your time to market and that costs you money.  

     Of course, one could always just over engineer the solution, but this will lead to extra power being used and increased size.  Power consumption isn’t just important in the mobile application processor market space.  It is important in all market spaces.  Wasted power, means wasted money!

In the next few weeks you will be learning about advances Carbon has made in the Virtual Prototyping methodology with additional CPAK's, additional system simulation, profiling and characterization  capabilities.  To learn more about our how Carbon's solutions can help you with advanced performance optimizations or to learn more about booting an operating system with a virtual prototype please click below.


Achieving Speed and Accuracy with an ARM Virtual Prototype

Carbon exhibited last week at ARM® TechCon™ in Santa Clara.  True to form, it was a very successful show for us and generated a lot of interest at our booth.  I also gave a talk together with Rob Kaye from ARM about how to create an ARM virtual prototype which has both speed and accuracy (You can view a copy of that presentation here or get it in whitepaper form)  The talk was quite well-attended and had roughly 60 attendees who had far more questions than we had time for at the end (you can still ask them by the way, post them as comments on this blog and I'll respond)  

Rob and I talked about the tradeoffs that seem to face all virtual prototype design teams: do you create a virtual prototype which runs fast and sacrifices cycle accuracy or do you create one which is cycle accurate but lacks the speed to develop software? The answer to the question is, of course, yes.  If you do it right, you can create a single virtual prototype that is both fast and accurate.  

Virtual Prototype Model Abstraction Graph

The traditional approach to get a fast, accurate virtual prototype is to compromise a bit of speed and a bit of accuracy.  This way you have a model which theoretically has the best attributes of both and none of the downsides.  In reality, the approach that would seem to please everyone typically upsets everyone instead.  It's too slow for use by software teams and too inaccurate for use by architects and firmware engineers.

This isn't to say that the AT approach hasn't been tried for IP models.  Look back four or five years in time and you would see AT models available from both ARM and MIPS.  As model complexity grew however both companies abandoned the creation of AT models and instead began offering fast functional models (such as ARM's Fast Models) which make no attempt to model cycle accuracy but have enough functional accuracy to enable software binaries to run on them without modification. Both ARM and MIPS then partnered with Carbon Design Systems to offer cycle accurate models of their processor IP.  If you take a look at the ARM IP section on Carbon's IP Exchange web portal, you'll see that we offer cycle accurate models of all of ARM's currently available processor models and also provide demo copies of the corresponding ARM Fast Models when they're available.  (The recently announced Cortex-A57 and Cortex-A53 aren't there yet but availability will be announced after RTL is available from ARM.)

                                                         Linux Booting on

Since Carbon obviously works closely with ARM during the creation of our IP models, we spend a good deal of effort to make our cycle accurate models interchangeable with the corresponding Fast Models.  This integration works well enough that you only need to create your virtual prototype once in SoCDesigner Plus regardless of the level of abstraction at which the prototype will run.  SoCDesigner Plus understands the mapping between Carbon's accurate models of ARM IP and the corresponding ARM Fast Models and will automatically create the fast system representation from the accurate one.  This way, you're not duplicating design and validation efforts creating separate virtual prototypes.  Create the system once and the unified virtual prototype can the be used by software engineers and architects alike.

Having a single virtual prototype which can run at different speeds and accuracy levels is a great feature but SoCDesigner Plus can take this one step further and enable the virtual prototype to begin running based upon Fast Models and then switch over to the 100% accurate representation at any breakpoint.  Now a user doesn't need to wait hours (or possibly even days) for a cycle accurate virtual prototype to boot Linux or Android and get to a point of interest to start debugging or system analysis.  Instead, you can get to the same point in seconds using Fast Models and then switch over to a 100% accurate representation to continue execution.  This technology, which we call Swap & Play, doesn't have to just create a single checkpoint either.  You can create multiple checkpoints to enable separate problems to be analyzed or different drivers developed.  

This is not a new development. Carbon has been shipping Swap & Play for a while now.  I've blogged about it before (here and here), published an article about it in ARM IQ, and our own Carbon Performance Analysis Kits (CPAKs) for the Cortex-A9, Cortex-A15 and Cortex-A7 all contain support for it (as will future ARM processors)  This Carbon exclusive technology has enabled our customers to solve problems with virtual prototypes that would previously have required expensive hardware prototype solutions.  Carbon's unique Swap & Play technology is enabling users to have speed when they want it and accuracy when they need it.


                Request more                                 FREE Download                                         FREE



All Posts