Tips, Ideas, Discussion for SOC Virtual Prototypes

Sometimes Hardware Details Matter in ARM Embedded Systems Programming

Posted by Jason Andrews on Fri, Jun 20, 2014 @ 04:34 AM

Last week, I received the call for papers for the Embedded World Conference for 2015. The list of topics is a good reminder of how broad the world of embedded systems is. It also reminded me how overloaded the term “embedded” has become. The term may invoke thoughts of a system made for a specific purpose to perform a dedicated function, or visions of invisible processors and software hidden in a product like a car. When I think of embedded, I tend think about the combination of hardware and software and learning how they work together, and the challenge of building and debugging a system running software that interacts with hardware. Some people call this hardware dependent software, firmware, or device drivers. Whatever it is called, it’s always a challenge to construct and debug both hardware and software and find out what the problems are. One of the great things about working at Carbon is the variety of the latest ARM IP combined with a spectrum of different types of software. We commonly work with software ranging from small bare-metal C programs to Linux running on multiple ARM cores. We also work with a mix of cycle accurate models and abstract models.

If you are interested in this area I would encourage you learn as much as possible about the topics below. Amazingly, the most popular programming language is still C, and being able to read assembly language also helps.

  • Cross Compilers and Debuggers
  • CPU Register Set
  • Instruction Pipeline
  • Cache
  • Interrupts and Interrupt Handlers
  • Timers
  • Co-Processors
  • Bus Protocols
  • Performance Monitors

I could write articles about how project X at company Y used Carbon products to optimize system performance or shrink time to market and lived happy ever after, but I prefer to write about what users can learn from virtual prototypes. Finding out new things via hands-on experience is the exciting part of what embedded systems are for me.

Today, I will provide two examples of what working with embedded systems is all about. The first demonstrates why embedded systems programming is different from general purpose C programming because working with hardware requires paying attention to extended details. The second example relates to a question many people at Carbon are frequently asked, “Why are accurate models important?” Carbon has become the standard for simulation with accurate models of ARM IP, but it’s not always easy to see why or when the additional accuracy makes a difference, especially for software development. Since some software development tasks can be done with abstract models, I will share a situation where accuracy makes a difference. Both of the examples in this article looked perfectly fine on the surface, but didn’t actually work.

GIC-400 Programming Example

Recently, I was working with some software that had been used on an ARM Cortex-A9 system. I ported it to a Cortex-A15 system, and was working on running it on a new system that used the GIC-400 instead of the internal GIC of the A15.

People that have worked with me know I have two rules for system debugging:

  1. Nothing ever works the first time
  2. When things don’t work, guessing is not allowed

When I ran the new system with the external GIC-400, the software failed to start up correctly. One of the challenges in debugging such problems is that the software jumps off to bad places after things don’t work and there is little or no trail of when the software went off the path. Normally, I try to use software breakpoints to close in on the problem. Another technique is to use the Carbon Analyzer to trace bus transactions and software execution to spot a wrong turn. In this particular case I was able to spot an abort and I traced it to a normal looking access to one of the GIC-400 registers.

I was able to find the instruction that was causing the abort. The challenge was that it looked perfectly fine. It was a read of the GIC Distributor Control Register to see if the GIC is enabled. It’s one of the easiest things that could be done, and would be expected to work fine as long as the GIC is present in the system. Here is the source code:

c1

The load instruction which was aborting was the second one in the function, the LDRB:

c2

The puzzling thing was that the instruction looked fine and I was certain I ran this function on other systems containing the Cortex-A9 and Cortex-A15 internal GIC.

After some pondering, I recalled reading that the GIC-400 had some restrictions on access size for specific registers. Sure enough, the aborting instruction was a load byte. It’s not easy to find a clear statement specifying a byte access to this register is bad, but I'm sure it's in the documentation somwhere. I decided it was easier to just re-code the function to create a word access and try again.

There are probably many ways change the code to avoid the byte read, but I tried the function this way since the enable bit is the only bit used in the register:

c3

Sure enough, the compiler now generated a load word instruction and it worked as expected.

This example demonstrates a few principles of embedded systems. The first is the ability to understand ARM assembly language is a big help in debugging, especially tracing loads and stores to hardware such as the GIC-400. Another is that the code a C compiler generates sometimes matters. Most of the time when using C there is no need to look at the generated code, but in this case there is a connection between the C code and how the hardware responds to the generated instructions. Understanding how to modify the C code to generate different instructions was needed to solve the problem.

Mysterious Interrupt Handler

The next example demonstrates another situation where details matter. This was a bare-metal software program installing an interrupt handler for the Cortex-A15 processor for the nIRQ interrupt by putting a jump to the address of the handler at address 0x18. This occurs during program startup by writing an instruction into memory which will jump to the C function (irq_handler) to handle the interrupt. The important code looked like this, VECTOR_BASE is 0:

c4

The code looked perfectly fine and worked when simulated with abstract models, but didn’t work as expected when run on a cycle accurate simulation. Initially, it was very hard to tell why. The simulation would appear to just hang and when the simulation was stopped and it was sitting in weird places that didn’t seem like code that should have been running. Using the instruction and transaction traces it looked like an interrupt was occurring, but the program didn’t go to the interrupt handler as expected. To debug, I first placed a hardware breakpoint on a change on the interrupt signal, then I placed a software breakpoint on address 0x18 so the simulation would stop when the first interrupt occurred. The expected instruction was there, but when I single stepped to the next instruction the PC just advanced one word to address 0x1c, and no jump. Subsequent step commands just incremented the PC. In this case there was no code at any other address except 0x18 so the CPU was executing instructions that were all 0.   

This problem was pretty mysterious considering the debugger showed the proper instruction at the right place, but it was as if it wasn’t there at all. Finally, it hit me that the only possible explanation was that the instruction really wasn’t there.

What if the cache line containing address 0x18 was already in the instruction cache when the jump instruction was written by the above code? When the interrupt occurred the PC jumps to 0x18 but would get the value from the instruction cache and never see the new value that had been written.

The solution was to invalidate the cache line after writing the instruction to memory using a system control register instruction with 0x18 in r0:

c5

Although cache details are mostly handled automatically by hardware and cache modelling is not always required for software development, this example shows that sometimes more detailed models are required to fully test software. In hindsight experienced engineers would recognize self-modifying code, and the need to pay attention to caching, but it does demonstrate a situation where using detailed models does matter.

Summary

Although you may never encounter the exact problems described here, they demonstrate typical challenges embedded systems engineers face, and remind us to keep watch for hardware details. These examples also point out another key principle of embedded software, old code lives forever. This often means that while code may have worked on one system, it won’t automatically work on a new system, even if they seem similar. If these examples sound familiar, it might be time to look into virtual prototypes for your embedded software development.

Jason Andrews

       

Tags: ARM Models, virtual prototype, cycle accurate models, ARM Cortex-A15, Jason Andrews, ARM Virtual Prototype, ARM, ARM GIC-400

Understanding ARM Bare Metal Benchmark Application Startup Time

Posted by Jason Andrews on Wed, May 21, 2014 @ 10:16 AM

One of the benefits of simulation with virtual prototypes is the added control and visibility of memory. It’s easy to load data into memory models, dump data from memory to a file, and change memory values without running any simulation. After I gave up debugging hardware in a lab and decided I would rather spend my time simulating, some of my first lessons were related to assumptions software makes about memory at power on. When a computer powers on, software must assume the data in memories such as SRAM and DRAM is unknown. I recall being somewhat amazed to find out that initialization software would commonly clear large memory ranges before doing anything useful. I also recall learning that startup software would figure out how much memory was present in a system by writing the first byte or word to some value, reading it back, and if the written value was read back there must be memory present. The way to determine memory size was to continue incrementing the address until the read value did not match the expected value and conclude this was the size of the memory.

Recently, I was working on a bare metal program and simulating execution of an ARM Cortex-A15 system. Carbon Performance Analysis Kits (CPAKs) come with example systems and initialization software to help shorten the ramp up time for users. Generally, people don’t pay much attention to the initialization code unless it’s either broken or they need to configure specific hardware features related to caches, MMU, or VFP and Neon hardware.

Today, I’ll provide some insight into some of the things the initialization code does, specifically what happens between the end of the assembly code which initializes the hardware and the start of a C program.

The program I was running had the following code at the top of main.c

main 

There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or whatever else is needed to complete the program and gain confidence it’s running correctly. This avoids wasting time thinking the program is running correctly when it’s not. I normally put a few breakpoints in the software and just feel my way through the program to see where it goes and how it runs.

I like to put a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup. Next, I like to put a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program I shrunk the size of the memspace array to 200 bytes for the first pass through the test.

After I understood the basics of the program, I put the array size back to the original value of 200000 bytes. When I did this I noticed a strange phenomenon. The simulation took much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.

Array size

Cycles to reach main()

         200

4860

200000

39174

One of the purposes of this article is to shed some light on what happens between the end of the startup assembly code and main(). Obviously, there is something related to the size of the memory array that influences this section of code.

Readers that have the A15 Bare Metal CPAK can follow along with similar code to what I used for the benchmark by looking in Applications/Dhrystone/Init.s

There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.

init1 resized 600 

Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) but if all goes well it comes out again at main().

The code starting from __main performs the following tasks:

  • Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
  • Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
  • Branch to __rt_entry

Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:

  • Sets up the stack and heap
  • Initializes library functions
  • Call main()
  • Call exit() after main() completes

If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup. I always recommend to take a look at this if your program doesn’t make it to main().

You may have guessed by now that the simulation time difference I described is caused by the time required to zero a larger array compared to a smaller array. As I mentioned at the start of the article, writing zero to large blocks of memory that is already zero (or can easily be made zero using a simulator command) is a waste of time. Carbon memory models already initialize memory contents to zero by default. Some people prefer a more pessimistic approach and initialize memory to some non-zero value to make sure the code will work on the real hardware, but for users more interested in performance analysis it seems helpful to avoid wasted simulation cycles and get on to the interesting work.

The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     721432     792724       c1894     main.axf

With the 200000 byte array:

-bash-3.2$ size main.axf

   text        data         bss         dec         hex     filename

  71276          16     921232     992524       f250c     main.axf

Alternatives to Save Simulation Time

There are multiple ways to avoid executing instructions to write zero to memory that is already zero. It turns out to be a popular question. Search for something like “avoid bss static object zero”.

One way is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.

For the program I was working on I decided to investigate a couple alternatives.

One solution is to just skip __main altogether and go directly to __rt_entry since __main doesn’t do anything useful for this program and the change is simple.

To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.

init2 resized 600

Here are the new results with __main skipped.

Array size

Cycles to reach main()

200

4355

200000

4362

As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark I have shown here, the problem gets much bigger when a larger and more complex software program is run. I checked a larger software program and found it was taking more than 10 million instructions to zero memory.

I wouldn't recommend just blindly applying this technique, especially on larger software programs, as debugging improperly initialized global variables is not fun.

Another possibility to avoid initializing large global variables is to use a compiler pragma. The ARM compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.

pragma

After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section. More info is available on the documentation on "How to prevent uninitialized data from being initialized to zero". 

In my linker scatter file I added one more section:

scat

Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.

Array size

Cycles to reach main()

200

4411

200000

4411

The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma. 

Hopefully this article provided some understanding of what happens between the initial assembly code and main(). Although there is no one-size-fits-all solution it is definitely helps to understand and improve application startup time. Next time you find yourself with a program that appears stuck before reaching main() this just might be the cause.

Jason Andrews

 

                               

Tags: virtual prototype, ARM Cortex-A15, CPAK, Virtual Platform, Jason Andrews, ARM, Bare Metal

SMP Linux on a Minimal Dual-Core ARM Cortex-A15 System

Posted by Jason Andrews on Tue, Apr 29, 2014 @ 09:13 AM

Previously, I explained how to create a minimal, single-core ARM® Cortex™-A15 system running Linux®. In this article I will update the hardware design to use a dual-core ARM Cortex-A15 CPU and run SMP (Symmetric Multiprocessing) Linux. While the first system was interesting, I’m fairly certain no single-core A15 systems have ever been built, and most engineers will require multi-core systems for common design and verification tasks.

Changes to the Hardware Design

Two hardware design changes are needed to enable SMP Linux. First, the CPU must be updated to the dual-core A15. This is a matter of simply updating the CPU model to the A15x2 CPU. For those who have not build models using Carbon IP Exchange, the web-based portal that builds models directly from RTL code, there is simple configuration page used to select the A15 model parameters. The key value is to make sure the “Number of CPU Cores” is set to 2 as shown in the picture below. 

A15 ipexchange resized 600

Once this is configured, and the model is created, it can be instantiated on the SoC Designer canvas in place of the single-core A15 model and connected to the interrupt from the PL011 UART and to the CCI-400 in the same way.

The second change is the addition of an extra memory which is used to communicate the starting address for the secondary core. This memory will take the place of the Versatile Express System Registers. Recall in the previous article I explained that 1 line of source code had to be changed to avoid using a timer value from the System Registers. For SMP Linux there is another register (offset 0x30) which is used to pass the jump address to the secondary CPU. For the minimal system, the only behavior needed is to provide a simple memory at the base address of the System Registers, this is system address 0x1c010000. Only offsets 0x30 and 0x34 are used and the values must be initialized to 0 because the secondary code waits for a non-zero value. When the seconary CPU sees a non-zero value it will jump to the address contained in at 0x1c010030. If this address is not 0 at startup the system will not boot properly. SoC Designer simple memory models initialize to 0 so no special handling is needed.

                                                 

Linux Changes

The Linux image needs to be recompiled with SMP support. This is done using the normal kernel configuration process.

$ make ARCH=arm menuconfig

The option to enable SMP is under the Kernel Features menu. Make sure Symmetric Multi-Processing is selected as shown below.

enable smp resized 600

After enabling SMP rebuild the kernel as before, adjusting the CROSS_COMPILE to match the prefix of your ARM cross compiler:

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4

This will create a new zImage which is ready for the dual-core A15 design. The remainder of the steps to prepare the final software image is the same. For the dual-core case I named the final product a15x2-linux.axf and loaded this file into the simulator.

Starting the Secondary CPU

The main challenge in running SMP Linux is getting the secondary CPU started. After reset, both CPUs will start running the code in boot.S which is located at 0x80000000. The first step is to determine if the code is running CPU0 or CPU1. This is done by reading the CPU ID register located in co-processor 15 (CP15). This register is also referred to as the Multiprocessor Affinity Register, MPIDR. It provides information about which core of an MPCore processor the code is running on, and which cluster of a multi-cluster system the code is running on. In this case we have only a single cluster and two cores so the code simply identifies CPU ID 0 as the primary core and CPU ID 1 as the secondary.

cpuid resized 600

The primary core finishes the boot loader and immediately starts running Linux, while the secondary core waits in the boot loader for a jump address to be provided at address 0x1c010030.

The picture below shows the code for the secondary CPU.

smp boot resized 600

The primary CPU which is running Linux is responsible to release the secondary CPU by writing the jump address and sending an interrupt. In the Linux 3.13.1 kernel, this is found in arch/arm/mach-vexpress/platsmp.c at line 225. Putting a breakpoint on this line of code and single stepping will reveal the details. The underlying code will write the jump address and take care of all the details to start the secondary CPU. The well commented last line in the screenshot below gives the details.

start seconary resized 600

Below is a screenshot of the memory contents for the System Registers address range. The primary CPU actually wrote 0xffffffff into address 0x1c010034 and then the 32-bit jump address into 0x1c010030. Because this is a memory view from the perspective of the CPU, I entered the virtual address into the memory viewer window, which is 0xf8010000. 

seconary mem resized 600

If everything works correctly, there should be some messages in the boot log showing that 2 CPUs are running. 

brought up 2 cpus resized 600

As a final check on the platform, I booted the SMP Linux and created a Swap & Play checkpoint. I restored the checkpoint into a cycle-accurate simulation and issued the command:

$ cat /proc/cpuinfo                   

The terminal output confirms the state of the simulation was successfully transferred to the cycle accurate simulation and both cores are reported to be running. 

cpuinfo

Summary

In summary, with the addition of an extra memory model in the location of the Versatile Express System Registers and modifying the Linux kernel configuration parameter to enable SMP we are up and running with a dual-core ARM Cortex-A15 minimal hardware design. This design supports all of the same Swap & Play and benchmarking features as the single-core design and can be extended to add more hardware detail for specific design tasks as needed.

As before, all of the work that I’ve done here to prepare Linux and the hardware design is available as a CPAK on Carbon IP Exchange in the Cortex-A15 CPAK area.

Jason Andrews

         

Tags: Virtual Reference Platform, virtual prototype, ARM Cortex-A15, SoCDesigner Plus, Virtual Platform, Jason Andrews, Embedded Systems Development, Linux

Running the Latest Linux Kernel on a Minimal ARM Cortex-A15 System

Posted by Jason Andrews on Tue, Apr 15, 2014 @ 08:06 AM

Linux® has become a popular operating system in embedded products, and as a result is experiencing increased usage for system design and verification. This means that a new set of people, who are not traditional embedded software engineers, need to learn to work with Linux to accomplish daily tasks.

For system design tasks, Carbon users generally proceed from bare-metal software benchmarks such as Dhrystone and CoreMark to running benchmark applications on an operating system. If the design uses ARM® Cortex™-A series processors, the most common OS choice is Linux. Linux has come a long way with respect to ARM Architecture since Linus Torvalds made his famous complaint about ARM support back in March, 2011. The Linux Device Tree has made it very easy for those of us who work with simulated hardware, and often partial systems, to be able to run Linux with almost no changes the kernel source code.

As I mentioned in my previous blog, I recently joined Carbon.  I decided the best way to learn would be to use the tools the way a customer would.  I’ll talk below about the steps I took to get Linux running on a minimal system using the ARM Cortex-A15 processor.  If you’re using a Cortex-A15, you’ll get even more benefit as this platform is available for download as a Cortex-A15 CPAK.

Why a Minimal System?

It’s easier to start with small system and iteratively add complexity – add peripherals, enable drivers, and verify at each phase that the system functions as expected. Starting off with a larger system introduces the difficulty of extracting hardware; it can be tricky to identify the software that relies on the hardware you want to remove.

It has also been my observation that many engineers designing and using leading edge SoCs are keenly interested in the performance details of key parts of chip such as the processors, interconnect, memory controller, and GPU. These processor sub-system architects don’t always want or need a lot of slower peripherals that are typically assumed to be present in systems running Linux.

Initial Challenges

The primary goal is to identify the minimum hardware needed to run Linux on an ARM Cortex-A15. It should be possible to run Linux with nothing more than the A15 CPU, memory, and a UART, so I set out to try it.

A secondary goal is to use the latest kernel from kernel.org and change as little of the Linux source code as possible. Minimizing source code changes makes it easier to update to new versions of Linux as they are released.

Methodology

Simulation speed is far more important than hardware accuracy when experimenting with Linux configurations to confirm a working kernel. This is an ideal situation in which to generate an ARM Fast Model design from the Carbon SoC Designer Plus canvas. After the kernel is working with the Fast Model design, it is easy to run the Carbon cycle accurate simulator, sdsim, for benchmarking and utilize Swap & Play to confirm a fully working virtual prototype that is both fast and accurate.

To meet the goal of changing as little kernel source code as possible, I started from a currently-supported platform, the ARM Versatile™ Express, then configured the kernel to use only the minimal hardware. New kernel versions will continue to support Versatile Express, making future upgrades easy to do.

Hardware Design

I decided to call my new hardware design the a15mini to indicate a Cortex-A15 system with minimal hardware. The Versatile Express design requires specifying the memory map and interrupt connections. The memory for the Cortex-A series Versatile Express is from 0x80000000 to 0xffffffff. The first PL011 UART is located at 0x1c090000 and uses interrupt 5 on the A15 IRQS[n:0] input request lines connected to the internal Generic Interrupt Controller (GIC). The Cortex-A15 needs to have the base address for the internal memory mapped peripherals (PERIPHBASE) set to 0x2c000000. The only other relevant information is that the UART runs from a 24 MHz reference clock.

Creating the a15mini with SoC Designer consists of instantiating the models and connecting them on the canvas using sdcanvas. Using cycle accurate models means more detail is needed to create the design. Instead of just the CPU, simple address decoder, and memory I used the ARM CCI-400 Cache Coherent Interconnect and the NIC-301 interconnect. These models, along with the Cortex-A15 are built using Carbon IP Exchange, the web-based portal that builds models directly from ARM RTL code. It’s pretty amazing to think that I answer a few questions or submit an XML file from AMBA Designer and get back a simple to use model in the form of a .so file and know that it was generated from a very complex RTL design of a processor like the Cortex-A15. It’s as if I’m using millions on lines of Verilog code and I never see any of it.  

The design is shown below:

 ARM Cortex-A15 Minimal Linux CPAK Screenshot

 Linux Preparation

Creating a Linux image for the a15mini is a little more complex than the hardware design procedure.  

There are many ways to prepare Linux, but at a minimum the following items are needed:

  • Boot Loader
  • Kernel Image
  • Device Tree Blob
  • File System

For this experiment I decided to make things as easy to work with as possible. The most straightforward way was to use a single executable file (ELF file) containing all of the above items; anybody who wants to run the platform needs only this one file to represent all of the artifacts. A drawback of this approach is that the file must be regenerated when any of the items changes, but it creates a generic solution that can be run on any kind of simulator.

Kernel Image

I downloaded Linux 3.13.1 from kernel.org as the starting point. This was the latest kernel at the time I did the initial simulation, but new versions are released frequently.

Kernel Configuration

The default configuration for the Versatile Express is found in the Linux source tree at: arch/arm/configs/vexpress_defconfig

First, I use the Versatile Express configuration as the baseline by running:

$ make ARCH=arm vexpress_defconfig

File System

To get up and running quickly, I cheated - I copied the file new-buildroot-rootfs.cpio.gz from the Carbon ARM Cortex-A9 Linux CPAK and renamed it fs.cpio.gz. (A future article may cover the various ways to make file system images.)

Customizing Kernel Configuration

To create the single executable file with all of the needed artifacts, I needed to embed the file system image in the kernel, and append the Device Tree Blob at the end of the kernel image.

To embed the file system image in the kernel, you can use any of the Linux configuration interfaces. I tend to use menuconfig:

$ make ARCH=arm menuconfig

I navigated to the General Setup menu (see the image below), scrolled down to “Initramfs sources file(s)” and added the name of the file system image, fs.cpio.gz. I put this file at the top of the Linux source tree so no additional path is needed.

fs config

To append the Device Tree Blob at the end of the kernel image, access the Boot options menu item “Use appended device tree blog to zImage (EXPERIMENTAL).” I enabled this to append the .dtb file at the end of the zImage file (shown in the image below). 

append dtb config

While I was in the Boot options menu I also set the Default kernel command string by adding root=/dev/ram and earlyprintk. This specifies to use a ram based root file system; the only possible choice since no other storage is included in the hardware design. There are many ways to set the default kernel command string, but this approach works for well for this application, in which we want to link everything into a single elf file.

Source Code Changes

The bad news is I wasn’t able to run the 3.13.1 kernel on the a15mini without any kernel source code changes. The good news is I came within 1 line of the goal. I needed to edit the file arch/arm/mach-vexpress/v2m.c to remove the line that configures the kernel scheduler clock to read a time value from the Versatile Express System Registers (this peripheral has a register which provides a time value).  To achieve my goal of including a minimum of hardware, I wanted to forgo the System Registers.

The line I removed is in the function v2m_dt_init_early() and is line number 423, the call to versatile_sched_clock_init().

describe the image

Compilation

Now the source tree could be compiled. I work on Ubuntu 13.10 and use the cross-compiler with the GNU prefix arm-linux-gnueabi, so my compile command was:

$ make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- -j 4

The most important compilation result is the kernel image file arch/arm/boot/zImage.

Device Tree Blob

The Versatile Express device trees for ARM Fast Models are available from linux-arm.org. The easiest way to get the files is to use git:

$ git clone git://linux-arm.org/arm-dts.git

The file that is the closest match for the a15mini in the fast_models/ directory rtsm_ve-cortex_a15x1.dts

This file also includes a .dtsi file named rtsm_ve-motherboard.dtsi. The main work to support the a15mini hardware system was to modify these two device tree files so they match the hardware available by removing all of the hardware that doesn’t exist.

So first, I edited rtsm_ve-motherboard.dtsi t1to remove all of the peripherals that are gone from the a15mini such as flash, Ethernet, keyboard/mouse controller, three extra UARTs, watch dog timer, and more. I left the original structure, but shrunk the file all the way down to just the UART!

Architected Timer Support

Linux needs a timer to run. I removed the SP804 timers that are present in the Versatile Express and the System Register block that provides the time value for the scheduler clock. Replacing these, I want to use an internal timer in the A15, sometimes called the Architected timer, to keep the hardware design as small as possible.

Support for the ARM Architected timer is already present in rtsm_ve-cortex_a15x1.dts, but I found that it didn’t work right away when I commented out the call to versatile_clock_sched_init(). The kernel crashed at the point of setting up the Architected timer, and printed a message that there was no frequency available. By looking through other device tree files, I found that some armv7-timer entries had a clock-frequency attached to them, while the Versatile Express entry did not. After adding the clock-frequency to the timer in rtsm_ve-cortex_a15x1.dts, the timer worked, and the Linux works as expected with the internal timer. No external timer is needed!

Here is the timer entry with the clock-frequency added:

timer

I have attached the two device tree files so you can take a look at them as needed.

Using the Device Tree Compiler

After the two device tree files have been shrunk down to support only the hardware available on the a15mini, they can be compiled into the device tree blob using the device tree compiler, dtc, which is available in the scripts/dtc directory of the kernel source tree.

I did not add anything to my PATH to find dtc, I just reached over into the kernel source tree and ran:

$ ../linux-3.13/scripts/dtc/dtc  -O dtb -o rtsm_ve-cortex_a15x1.dtb\ fast_models/rtsm_ve-cortex_a15x1.dts

Adding Device Tree to Kernel

To use the feature that appends the device tree to the Linux kernel image, I simply copied the zImage from the kernel tree.

$ cp linux-3.13.1/arch/arm/boot/zImage  .

Then concatenated the device tree blob to the end of the kernel image:

$ cat arm-dts/rtsm_ve-cortex_a15x1.dtb >> zImage

Boot Loader

There are many ways to create a boot loader, but to meet the goal of a single, all-inclusive executable file, a small assembly boot loader is the best. A great example of this is available in the ARM Fast Models ThirdParty IP package. This is an add-on to ARM Fast Models which contains all of the open source software that can be run on Fast Models examples. The majority of the package is the Linux source trees and file system images for different examples.

I selected two really useful files from the RTSM_Linux source code that comes with the ThirdParty IP package:

  • boot.S - The assembly file that serves as the boot loader. It’s small and easy to understand, and much easier to work with in cases where a full boot loader like u-boot is not needed.
  • model.lds - A linker script that specifies how to link boot.o (compiled boot.S) and zImage into a single executable file.

The addition of these two files, combined with the modified zImage (including embedded file system and concatenated device tree) means that everything is now present in the single executable file.

I also made some minor adjustments to the Makefile, inserting my paths, compiler name, and file names so it would generate the file a15-linux.axf as the final output that will be used in simulation.

The updated Makefile is shown below:

boot make

Running Simulation

Because the interconnect is more complex than a simple address decoder, there are two ways to confirm the design is correct.

  • Run Linux on a fast version of the design
  • Run a small software program in cycle accurate mode

I would recommend both methods to make sure the design is functioning properly and there are no errors in connecting the interrupt, setting CPU model parameters, or other common mistakes made during design construction. My most common mistake is forgetting to set the PERIPHBASE address of the A15 to 0x2c000000.

Since I’m focusing on Linux I will describe how to run a fast version of the design.

From the Tools menu in sdcanvas select FastModel System…

 tools fast model

The Fast Model System Creator dialog will appear as shown below. Clicking the Create button will generate a Fast Model equivalent system automatically from the cycle accurate system.

fm system

This Fast Model system can be run in sdsim using the Simulation -> Simulate System … (or F5). When the simulator starts specify the a15-linux.axf file that was created in the last section and watch Linux boot in a matter of seconds. The terminal below shows 3.13.1 Linux with the machine type reported as Versatile Express.

sdsim fm boot

Swap & Play

Now we have established a working cycle accurate simulation and a Fast Model simulation for the same design, both of which can be run with the SoC Designer simulator, sdsim.

Working with Linux on a cycle accurate simulation is a great way to study all of the details of the hardware, software combination such as bus utilization, cache metrics, cache snooping, barrier transactions, and much more. It’s exciting until you realize that Linux takes about 300M instructions to boot and well over a billion clock cycles.

One alternative to waiting for a full cycle accurate simulation is to create Swap & Play checkpoints at various points of interest and then load the checkpoints into the cycle accurate simulation. To create a checkpoint run the Fast Model simulation and then stop the simulator using either a breakpoint or just hitting the Stop button at the place you want to stop, such as the Linux prompt. Use the File -> Save As menu in sdsim and select Swap & Play Checkpoint (*.mxc) as shown below:

sandp

The next dialog will ask for a name of the checkpoint and a location for the file. Enter any name and hit OK to save the checkpoint.

To load the checkpoint into the cycle accurate simulation load the design into sdsim and then use the File -> Restore checkpoint view…

Select the checkpoint saved in the Fast Model simulation and it will load into the cycle accurate simulation. There is no need to even load an image file for this case since it will be restored by the checkpoint. The disassembly window and the register window will show the same location that was saved from the fast model simulation. I saved a checkpoint at the Linux prompt and restored it into the cycle accurate simulation and I can even see I’m sitting at the WFI instruction in the Linux idle loop. 

disass

A debugger such as RealView or DS-5 can also be connected to start source level debugging.

An alternative trick I use is the addr2line utility to find out where in the code the Disassembly or Register Window is showing. This is useful to just take a peek at the current location without starting the full debugger. For the disassembly window above I do:

$ arm-linux-gnueabi-addr2line -e vmlinux 0x80014524 /home/cds/jasona/kernel.org/linux-x1/linux-3.13.1/arch/arm/mm/proc-v7.S:73

Now I can see the source code for the current location as a quick check. Sure enough, I’m sitting at the Linux idle loop as expected since nothing is happening sitting in the shell at the prompt. Here is the code:

idle

It’s easy to imagine using swap & play to load checkpoints and do cycle accurate debugging as well as performance analysis for benchmarks. Carbon users typically run benchmarks such as Dhrystone, CoreMark, and LMbench as Linux applications.

So far, this was done using a single core A15 CPU. It’s interesting but there are no actual systems that use a single core A15. Next time I will show how to extend the a15mini for multi-core simulation by moving to the ARM Cortex-A15x2 CPU and running SMP Linux, again with as little hardware as possible. The dual-core A15 matches one of my favorite machines, the Samsung Chromebook.

As I mentioned at the beginning, all of the work that I’ve done here porting Linux and setting up a system which can be used with Swap & Play is available as a CPAK on Carbon IP Exchange.  You can use this system to port your own version of Linux or customize the hardware configuration to match that of your own design.

Jason Andrews

        

 

Tags: virtual prototype, ARM Cortex-A15, CPAK, SoCDesigner Plus, Virtual Platform, Jason Andrews, Linux, ARM

Getting Up and Running with the ARM Mali-450 MP GPU (Part 3 of 3)

Posted by Varun Subramanian on Tue, Dec 10, 2013 @ 09:00 AM

In the previous parts of this blog series, I discussed the procedure of Carbonizing the RTL of an ARM Mali-450 GPU and building an SoC Designer component, followed by the description of a virtual prototype environment that runs cycle-accurate integration tests provided by ARM on the Mali-450 GPU using a Cortex-A15 based system.

This part describes a Linux-based Carbon Performance Analysis Kit (CPAK) containing a Cortex-A15 based system with an ARM Mali-450 GPU. This CPAK enables the user to integrate the Mali GPU drivers into the Linux Kernel and registering the Mali driver while booting Linux on the virtual prototype environment.  We’ll do this by combining the speed of ARM’s Fast Models together with the accuracy of the Carbonized GPU model in the same system.

Partitioning the System

The block diagram in Figure 1 shows the Mali-450 CPAK that we’ve been working with so far.  This system is comprised entirely of 100% accurate models which enable us to see the impact of our IP configuration settings and bare-metal software. As we migrate to higher level tasks such as driver development, we can map portions of the system to more abstract models in order to execute at faster speeds. SoC Designer automatically understands the relationship between Carbonized models of ARM IP and their Fast Model equivalents so all we need to do to is tell the Fast Model Generator tool in SoCDesigner which models we would like to execute accurately, and which ones should be represented as Fast Models.

 ARM Mali-450 CPAK Block Diagram

The dotted line indicates the portion of the virtual prototype we’d like to execute as ARM Fast Models. The rest of the system will continue to execute as cycle accurate models. Once we’ve identified the models for each domain, SoCDesigner will automatically build a new virtual prototype representation containing both Fast Models and Carbonized, 100% accurate models. Figure 2 shows a block diagram that describes such a system. All of the necessary transactor logic to map between the abstract models and the accurate models is inserted automatically.

 

Using the CPAK with ARM Fast Models

Now that we’ve updated the CPAK to ARM Fast Models, let’s put it to use. Since the virtual prototype will run at ARM Fast Model speeds when the accurate models are not being used, we can now use this platform to quickly boot Linux and integrate the Mali device driver.  In order to do this, I integrated the Mali Linux drivers into the Linux kernel that was used as part of the A15 Linux CPAK. The Linux drivers are part of the Mali-450 GPU Linux Driver Development Kit (DDK) provided by ARM. The DDK contains the Mali Linux device drivers and the base drivers. It also provides links to download the OpenGL drivers, specific to Mali-450 GPU. The Integration Guide was used, along with minor modifications specific to the virtual prototyping environment, to integrate the software into the Linux kernel. The integration procedure involved the following steps:

  1. Integrating and building the Mali Linux device driver: The user specifies the macros corresponding to the GPU configuration parameters, along with configuring the Mali GPU memory, the framebuffer memory and power management options. The device driver was built as part of the kernel image, but it can also be built as a kernel module.
  2. Building the OpenGL and GLES libraries and adding them to the root file system.
  3. Building the GPU benchmarks and adding them to the root file system.
  4. Building the kernel image that contains the Mali device drivers, base drivers, the OpenGL drivers and the GPU benchmarks.

 

This CPAK can prove to be useful to device driver developers who could step through the process of loading the Mali device drivers during boot-up. Figure 3 shows the messages printed by the console while loading the Mali Linux drivers with debug messages turned on. It describes the whole procedure of registering the drivers that starts with initializing the Mali memory system and ends with initializing the platform device, provided all intermediate steps pass. The intermediate steps define the settings for the framebuffer, the dedicated and shared memory. Each of the internal components of the Mali-450 that include the L2 cache, the Graphics Processor, the Pixel Processors and the Dynamic Load Balancing unit are then created and the corresponding base addresses are defined.

 describe the image

The Results

With the driver ported to Linux, it is now possible to quickly boot the OS in just over two minutes and then start processing frames with 100% accuracy.  The benchmark that uses the Mali base drivers runs test suites on:

  1. GPU memory allocation
  2. Initialization of the Graphics Processor
  3. Initialization of the Pixel Processors
  4. Running vertex shader jobs
  5. Running rendering jobs that draw a simple triangle.

Getting the whole system up and running took me 3 weeks, but all of the work that I’ve done is incorporated into the Mali-450 CPAK meaning that you can duplicate the same results within minutes of downloading the package. The package includes an app note that describes the steps that the user would need to take to use pre-built binaries to perform the integration process.

 

              

Tags: ARM Cortex-A15, CPAK, Performance Analysis, Swap & Play, Benchmarks, ARM Mali-450, Varun Subramanian

Getting the Most Out of Advanced ARM IP

Posted by Bill Neifert on Wed, Oct 23, 2013 @ 04:36 PM

Fall is always a busy time of year here in New England where Carbon is based. The students are back at the colleges and universities, leaf peepers clog up all the roads and for a brief period all four major pro sports teams are playing (Go Red Sox!).  Fall is also Carbon's busy time for conferences as well.  Although we don't typically exhibit at many tradeshows, we do try and have a presence at most of ARM's technical conferences.  ThisARM Techcon 2013 flurry of activity starts with ARM® TechCon™ in Santa Clara and typically ends six weeks later with the European Technical Symposium in Paris.  Although it makes for a large amount of travel, it's a great opportunity to speak with lots of ARM designers and programmers about the challenges they're facing.

Carbon will be present at most of these conferences (the full list will be at the end of the blog) and making presentations at most of them as well.  At ARM TechCon, we'll be doing a joint presentation with ARM entitled "Getting the Most Out of the ARM CoreLink™ NIC-400."  This presentation will go over some of the highlights of ARM's NIC-400 product and then discuss a two-step methodology to optimize this crucial piece of IP to best attain your design goals.  After last year's ARM TechCon joint presentation with ARM, "High Performance or Cyle Accuracy? You Can Have Both!" we made the corresponding whitepaper and presentation available for download and we will do that once again this year after the conference.

At the other conferences, we'll be presenting "Getting the Most Out of Advanced ARM IP" which will discuss methodologies to optimize some of ARM's other recently announced IP blocks such as the ARM Cortex™-A57, Cortex-A53, Cortex-A15, Mali and several others.  While I don't want to give away the entire presentation here, you can be sure it will probably make some mention of implementation accuracy, performance optimization and executing pre-silicon firmware.

If you're attending any of these conferences, please come by the Carbon booth and say hi.  We'd love the chance to talk with you about the latest work we've been doing to enable customers to optimize their ARM-based SoC designs and get the most out of advanced ARM IP.

Carbon will be participating in the following conferences

Date Event Location
October 30-31 ARM Techcon Santa Clara Convention Center, Santa Clara, CA
November 19
ARM Technology Symposium
Intercontinental Seoul Coex, Seoul, Korea
November 25
ARM Technology Symposium
Sheraton Hongqiao, Shanghai, China
November 27
ARM Technology Symposium
Sheraton Dongcheng, Beijing, China
November 29
ARM Technology Symposium
Ritz-Carlton Futian, Shenzen, China
December 6 ARM Technology Symposium Tokyo Conference Center, Tokyo, Japan
December 12 ARM Technology Symposium  CAP 15, Paris, France 

                

Tags: ARM Cortex-A15, CPAK, ARM Cortex-A53, Bill Neifert, ARM Cortex-A57, ARM NIC-400

Standing Out in the Crowd, Differentiating Your ARM-based SoC

Posted by Bill Neifert on Tue, May 14, 2013 @ 08:30 AM

As system on chip (SoC) designs have grown from 10s to 100s of millions of gates, designers have had to go to great lengths to deliver designs which are well differentiated from the competition.  Whereas the majority of the content of previous generation SoCs may have been designed internally or created from scratch for a new generation, this is certainly not the case nowadays as the vast majority of intellectual property (IP) blocks are reused from previous designs or, more likely, purchased from external sources.  You only need to look at the fantastic market success enjoyed by IP companies such as ARM and Arteris to see how much the industry now relies upon third party IP to drive their system.  (The recent spate of purchases by Cadence® shCarbon Design Systems Trusted Path to Accuracyows how much importance they certainly see in this market segment.) As the third party content of the chip rises however, it becomes increasingly difficult to differentiate your SoC design from others in the marketplace.  If everyone is designing using an ARM® Cortex™-A15 processor, Arteris® FlexNoC™ interconnect, Cadence memory controller and Imagination Technologies® GPU and you’re using the same IP how can you differentiate your system design, especially if you’re all using the same fab?

As the leading provider of 100% accurate virtual IP models and systems (including models from ARM, Arteris, Cadence, Imagination Technologies and many more), Carbon has seen this exact problem play out time and time again throughout our customer base.  Differentiation is achieved in many ways of course, but the most advanced approaches typically focus around a few key design areas (and I’ll confine my discussion to the front end of the design cycle here since it’s where you can make the most impact): configuration, integration, and power instrumentation. 

Configuration

IP configuration seems intuitively obvious, but the complex interactions between all of the various options in any single IP block can lead to a huge difference in the achievable performance.  Tie that block in with a number of other, similarly configurable blocks and the possible options grow exponentially.  Leading edge designers are taking a two-stage approach towards IP configuration: stand-alone and integrated.  Stand-alone optimization is just what it sounds like.  The IP block is placed in a very simple test system where the IP block is modeled along with traffic generators and receivers on all of the various relevant ports.  This is intuitive of course because it’s how many blocks are verified.  Instead of focusing on verification however, the emphasis is on the interplay of various configuration options. The IP options are exhaustively modeled and then subjected to representative system traffic.  This approach can quickly point out which options will best meet your target specifications. It seems intuitively obvious but only a small number of IP teams were employing this approach for non-interconnect IP when we first started working with them.  A recent design was able to double the performance on memory reads by applying these methods to a memory controller configuration that was already shipping in an earlier generation product.   

                                          

In the interconnect world, the story is a bit different as it’s quite common to construct a virtual platform containing traffic generators combined with real stimulus in order to try out various configuration settings.  Carbon has partnered with Arteris to make pre-built virtual prototypes (called Carbon Performance Analysis Kits™ or CPAKs™) available for exactly this purpose.  These CPAKs are available containing various ARM processors (including the ARM Cortex-A15, Cortex-A7 and Cortex-A9 with more to come) as well as multiple traffic generators all tied together either with generic interconnect components or IP from ARM or Arteris.  Bare metal and OS is software is bundled as well to configure and drive the system.  It’s an ideal starting point for quick optimization or customization to more closely reflect your own design.

CPAK featuring an ARM Cortex-A15 and Arteris FlexNoC

Screenshot of a CPAK featuring a dual-core ARM Cortex-A9 together with an Arteris FlexNoC and multiple configurable traffic generators

Integration

Achieving the best system performance is not possible on a per component basis and must be looked at within the context of a system.  Traffic generators can give a first approximation of real traffic but the only true way to validate the performance characteristics of a system is to pull it together and have it execute real software.  Now the true impact of various configuration settings can truly be seen and problems uncovered.  The interaction between high priority arbiters and high bandwidth components seems to be an especially problematic one as the performance is tweaked to make sure that components aren’t starved and power isn’t wasted (overdesign is just as bad a problem as under-design when you’re on the bleeding of mobile devices) 

The impact of software here cannot be overstated as it is ultimately the system software that determines how well the overall system will function and how much power it will burn.  Ensuring that the software can correctly interact with the hardware to achieve the desired performance is a key milestone and leading edge designers subject our virtual prototypes to billions of bare-metal and OS-driven cycles running benchmarks of various flavors to not only stimulate the systems in various software-driven ways but also validate that the product being built can live up to the marketing claims they’ll be making.  Have you noticed that certain companies seem to always produce the fastest phone chips?  They’re the ones that don’t look at speed optimization as a hardware problem or a software problem but rather an integrated system problem.

Here's a whitepaper published a few years ago by Samsung discussing how they used a cycle accurate virtual prototype to optimize the performance of their software even after the silicon had already been fabricated.  Ideally you do this step earlyier in the design so you can impact the hardware decisions as well but the results which Samsung obtained are still impressive.

                                     

Software Driven Power Optimization

Hold it.  Power?  Didn’t we promise to focus on front-end issues?  Although traditionally relegated to the back end of the design cycle, power decisions are being moved forward in the design cycle and many leading edge SoCs today have a concerted effort to measure the consumption of their devices while it’s still straightforward to make design decisions to reduce the power consumed by the system.  This is typically done by executing system software on an accurate virtual prototype and then tracking the various power metrics.  This can be implemented using a straightforward approach such as dumping waveforms while executing system software and then analyzing these results in your favorite EDA power tool.  Software-based power vectors give a much better indication of how the device will actually perform and enable much more meaningful power decisions than are possible with vectors derived from an RTL testbench. 

More sophisticated customers have adopted an instrumentation flow to enable power analysis. Instead of dumping waveforms during execution and then running these through a power analysis tool, they do a preliminary step of creating power number which correspond to the various power states of each model and then instrument the model using callbacks to dynamically track these states throughout the system.  This enables runtime power analysis to be done without requiring waveforms and third party power tools.  It also means that the system itself runs substantially faster which is always a big benefit.  We’ve published a whitepaper which details the implementation steps required for both steps.

                                              

Evolving design techniques and an increased reliance upon third party IP have eliminated many of the approaches that designers have used to differentiate their products and distance themselves from the competition.  When one door closes however, another opens and the design areas discussed here are being used today by leading-edge design teams to attack this space and create cost-effective, differentiated products.  Adopting these approaches along will enable your designs to keep pace and stand out in the crowd. 

          

Note, this post was adapted from an article we included in the most recent newletter from Arteris.

Tags: ARM Cortex-A15, CPAK, ARM Performance Optimization, ARM Cortex-A9, Carbon Performance Analysis Kit, Bill Neifert, ARM Cortex-A7, Power Analysis

System Level Performance Optimization

Posted by Bill Neifert on Tue, Mar 26, 2013 @ 01:52 PM

Much has been written about the ways to accelerate an SoC design schedule.  If you added up all the marketing claims made by the various EDA companies on time to market savings, you’d end up being able to ship your advanced SoC months before you even conceive of the idea.  We’ve been working on a series of blogs lately focused around SoC design issues and questions that were laid out in the Asking the Tough Questions blog a few weeks ago.  So far we’ve talked about choosing IP, configuring it correctly and optimizing your memory subsystem.  Today we’re introducing software into the equation.  Instead of talking about how this can pull in your overall design schedule though (since that gets written about all the time) I’d like to focus on how software integration can be a true product differentiator.  Driving your SoC under development with system level software doesn’t just get things done more quickly. If done properly, you can use your software to drive the validation and optimization of the SoC being designed and create a more differentiated system-level offering.

Assembling the System

The first step in the process of course is to assemble the system that you’ll be optimizing.  This can be a stumbling block of course since, very often, the people optimizing the system don’t have the intimate design and modeling knowledge to prototype it.  This leads to a few possibilities: bite the bullet and learn about all the IP and steps involved to tie it together and configure it; don’t worry about the lower level details of the system just abstract them away; take an existing system model from someone else and customize it to meet your needs.  Since we’ll be talking about optimizing a system early in the design process, we’ll confine the discussion to virtual prototypes since physical prototypes and emulators are available too late in the design cycle to enable system level optimizations.

There are pluses to all of these approaches of course.  Learning about all of the IP being used in the system and then assembling a virtual prototype has a definite allure.  After all, it’s good to know about the blocks in your system and everyone likes learning new things.  Of course, there’s a downside to this as well.  By the time you know enough about the complexity of each block in order to tie a system model together (and possibly needing to create virtual models for some IP), the design window is likely to have passed. 

Abstracting away all the lower level implementation details has a great deal of appeal and it’s how a lot of virtual models are created.  This approach is great for high level software development since a properly created model will be functionally equivalent to the actual IP but just without the timing details.  This enables the model to run fast and also be developed quickly without knowing a lot of the lower level details about the IP block.  The downside of course is that these lower level timing details are really needed if we want to optimize the system level performance of the system.  You can try and inject some level of timing details to mimic the performance of an actual IP block but this quickly leads to approximations built upon approximations built upon a model which never attempted to model timing accuracy in the first place.  If you really need to see the performance impact of code running on a multi-issue, out or order, dual pipelined, multicore processor, are you going to trust timing approximations which you insert into your high level model which doesn’t even model the pipeline, let alone all of the speculatively executed code needed to fill those pipelines?  In the end, a more accurate model is needed if we’re truly going to optimize system level performance.

If hardware prototypes are available too late, creating the virtual models yourself takes too long and using high level virtual prototypes leads to incorrect answers, how can you quickly create a prototype which is accurate enough for system level optimization? 

Speed and Accuracy

This, of course, is where Carbon enters the picture.  SoCDesigner Plus enables you to combine 100% accurate models from leading IP vendors such as ARM, Imagination Technologies, Cadence, Arteris and others which are available from Carbon’s IP Exchange web portal.  These can be combined together with SystemC models, user RTL models compiled using Carbon Model Studio and also ARM’s Fast Models. Support for ARM Fast Models isn’t unique to Carbon of course, every vendor offers this.  Only Carbon however offers the capability to mix these fast, functional models with their 100% accurate equivalents and boot Linux or Android in seconds and then debug software or profile performance with 100% accuracy.  We’ve talked about this before a few times (here and here for example)  So instead of talking about the feature itself, let’s dig a bit deeper and talk about how this feature can be used to differentiate your SoC design.

A15 Linux CPAK Profiling Screenshot resized 600

Figure 1: Profiling the Linux boot process on a Cortex-A15 CPAK

A System Approach

The key to optimizing system performance is to address it as a system problem, not a hardware or software issue.  Teams spend huge amounts of effort optimizing the performance of the hardware subsystem and an equally large amount of time enhancing the speed of the software but it’s far too rare that these two tasks are performed concurrently.  This means that there are missed opportunities for additional levels of optimization. 

Having a virtual prototype which is both fast and 100% accurate enables this system level optimization to take place.  The typical first step is to port one of the industry standard benchmarks to the platform.  This is always a valuable step since the benchmark will be run on the finished silicon to provide marketing positioning.  Why not execute the same benchmark during the development process to identify areas for optimization? 

                                                 

The initial port of software benchmark code can be easily be done on a high level representation of the system.  In our case, this means running with ARM Fast Model representations of the processor(s) and related IP instead of their Carbonized equivalents.  This fast platform is great for functional code porting but doesn’t reveal much about system performance.  Therefore, once the code is ported and up and running it’s time to start running accurately to really start getting the value from running the benchmark.  Swap & Play is great for this task since you can run quickly up to the point of interest, just before the benchmark starts for example, in the Fast Model representation and then switch over to the 100% accurate representation of the system to get correct results.  You can then use these results to optimize the interaction of the various components of the system.  SoCDesigner’s performance analysis visualization tools enable you to quickly identify performance bottlenecks and how they correspond to other activity in the hardware or software.

System software can be similarly optimized.  When executing software on the accurate representation of the system it is straightforward to track the performance characteristics of both the hardware and software using SoCDesigner’s visualization tools.  Now it is quite simple to draw correlations between software routines and their impact on overall system performance.  This is obviously of great importance as system complexity increases.  We’ve blogged previously about how Samsung used this approach with great success to optimize the software performance of a hybrid disk controller after silicon was already delivered.  The whitepaper discussing this is available as well.  This same approach is even more valuable when the hardware is still in a state where it can also be optimized.  After all, system optimizations are typically most effective when they’ve based upon hardware and software features.

                 

The approaches below are just scratching the surface of the optimizations that are possible when running real software on real hardware in advance of silicon.  We didn’t even touch on the verification values which can also be obtained and there are numerous additional optimizations which can be uncovered as well.  We’ll leave those for a future blog.

                                      

Tags: ARM Cortex-A15, CPAK, Performance Analysis, Swap & Play, Benchmarks, Architectural Optimization, ARM Performance Optimization, Carbon Performance Analysis Kit, ARM Performance Analysis, Bill Neifert, ARM Virtual Prototype

Interconnect Optimization

Posted by Eric Sondhi on Tue, Mar 05, 2013 @ 08:30 AM

Bill’s last blog summed up the tough questions SoC architects face at various phases of the design process and Andy’s last blog provided a great description of how Carbon customers tackle the many questions that arise during the IP selection process.  In this blog, I will move on to the interconnect and the questions involved with optimization to allow for a balanced SoC:  For performance critical paths, how can the bandwidth be maximized while minimizing the average latency?   How can less performance sensitive paths be managed without disrupting higher priority traffic?  How can the fabric strike the appropriate balance between throughput and latency for all paths through the bus?  What is the best way to isolate and eliminate performance bottlenecks?  How will cache coherency impact the interconnect traffic – and system throughput? 

Since this topic is a lot deeper than can be explored in a blog posting, you can get a much more in depth analysis by downloading our whitepaper on interconnect optimization.

Chicken or Egg?

The SoC architect has encountered a unique challenge with selecting and optimizing the system bus.  Usually the CPU core(s) and possibly the GPU are selected prior to interconnect but other IP blocks including the main memory controller, DMAs, general purpose peripherals are yet to be selected.  Thus the interconnect is to be designed and optimized with an incomplete understanding of the actual workloads that it will need arbitrate and load balance, as well as uncertainty of the latency and throughput constraints that the actual slave devices will impose.  It is no surprise that temptation to overdesign a low latency mesh is so strong, as Bill noted. 

The paradox is that the key design decisions in optimizing the fabric require an understanding of the final system.  Consider the performance sensitive path between the CPU and main memory.  The interconnect must harmonize with the main memory controller configuration and programming such that the memory bandwidth is maximized while the latency through the bus is minimized.  This design tradeoff is fundamental and must be accurately profiled, characterized, and displayed for the SoC architect.

SoC Designer Latency & Throughput

The relationship between latency & throughput visualized in SoC Designer

A comprehensive I/O prioritization scheme must be established such that the fabric can provide flow control and arbitration to deliver appropriate QoS for the application.  Moreover, real world dynamics such as transient shifts in traffic, head-of-line blocking, can completely disrupt the designed flow control of the interconnect.  Download the whitepaper for a more in depth look at impact of slave device latency variations on traffic prioritization.

Carbon customers have found effective ways of resolving this “chicken or egg” design dilemma of configuring interconnect to meet system performance objectives while providing robust QoS early in the design process, when the exact I/O workloads and target constraints are unknown. 

A Two-Phased Approach:

The first phase involves configuring and building a virtual prototype to quickly and easily isolate performance bottlenecks.


   Phase 1 – Architectural Exploration

  • 100% cycle accurate model of interconnect available from Carbon IP Exchange
  • Carbon AXI traffic generators
  • Flexible memory sub-system models

   Allows fast cursory optimization:

  • Broad traffic profiling and sensitivity analysis
  • Transaction tracing and back-pressure identification
  • “What if?” analysis

Arteris Flex NoC Architectural Optimization 

Architectural exploration platform with Arteris FlexNoC 

 

As IP blocks are selected later in the design process, the virtual protype can be reused and modified to incorporate 100% implementation accurate models in place of traffic generators:


   Phase 2 –Real World Virtual Prototype

  • 100% cycle accurate models of CPU & other IP
  • Real world application & traffic
  • HW & Software Interaction

   Allows architectural validation:

  • Accurate traffic profiling
  • Incremental optimization as more IP blocks are selected or designed
  • Cache coherency analysis
Arteris FlexNoC ARM Cortex A9 

Arteris FlexNoC based Real World Virtual prototype

 

Design tradeoffs explored earlier in Phase I can be revisited and validated against actual multicore CPU traffic.  The platform can be further modified as DMA IP blocks and memory controllers are selected.  

Read more about this iterative two-phased approach to interconnect optimization in our white paper.

Coherency & Accuracy

Hardware based cache coherency in SoC’s has introduced significant complexity to the interconnect optimization process.  Artificial workloads from custom traffic generators are not well suited to replicate coherency operations alongside the application workloads with high fidelity to the coherent workloads from actual IPs.  Virtual prototypes, such as the dual ARM® Cortex™-A15 with coherent ARM CCI-400 interconnect, allow for 100% implementation accurate simulation of coherent multiprocessor CPU traffic:

ARM CCI-400, Architectural Optimization, Cache Coherency

Carbon A15 bare metal CPAK multi-processor reference platform

The system above, from the Cortex-A15 Carbon Performance Analysis Kit (CPAK) allows for complete visibility of ACE traffic workloads in a SoC Designer environment.  The relationship between the hardware coherency handshaking events in the CCI-400 can be profiled and visualized alongside the bus traffic to get a deeper understanding of the correlation between hardware and ACE traffic.  

ARM CCI-400, Cache Coherency, Architectural Optimization

CCI-400 Coherency event profiling in relation to A15 ACE interface transactions

Understanding the impact of coherency operations within the interconnect and to the overall system performance can help architects partition their designs across multiple networks on chip. 

Download the white paper for an in-depth look into how cache coherency affects bus traffic and how the SoC Designer environment facilitates the analysis of this complex interaction.  

      

Tags: virtual prototype, ARM Cortex-A15, SoCDesigner Plus, Performance Analysis, Architectural Optimization, Carbon Performance Analysis Kit, Arteris FlexNoC, Eric Sondhi, ARM CCI-400

Asking the Tough Questions

Posted by Bill Neifert on Wed, Jan 30, 2013 @ 08:30 AM

It always seems that there are more questions than answers when embarking upon any new endeavor.  System on chip (SoC) design is no exception to this rule.  There are many approaches that you can use in an attempt to answer these questions.  How much can you trust the answers that you get and how much of your future do you want to gamble on those answers?  The best way to answer that may be by asking a lot moreVirtual Prototype Questions questions…

Assuming that you’re designing an SoC, a typical design process starts with IP selection.  There is a lot of IP available from different vendors.  How do you choose the IP that best meets your design needs from a price, performance and area perspective?  Do you need an ARM® Cortex™-A15 or can you meet your design goals with a Cortex-A9 instead?  Should you reuse the hardware codec from the previous design or replicate that functionality in software for this revision?  It’s tempting of course to just choose the latest, fastest IP available but this approach has all sorts of potential issues.  You may end up with a very overdesigned SoC.  This design may blow away its performance targets but at the expense of much higher IP costs and potentially much higher power consumption.

                                           

Once you’ve chosen your IP blocks, either from external or internal sources, they need to be configured to play together efficiently to meet your design goals.  Once again, it may be tempting to use a fully connected mesh to hook together all of your IP with wide, low-latency busses but you’ll probably spend more power than you need and have unnecessarily fast speeds on some data transfers.  How will cache coherency cycle figure into your bus utilization?  The latest coherency extensions to the AMBA bus protocols are dramatically changing the shape of system bus traffic.  How much will this impact overall system throughput?  If the design is partitioned into multiple fabrics (and most advanced designs do indeed contain multiple fabrics or networks on chip) what will be design impact of placing an IP block in one domain versus another?

The interconnect is of course only a part of the system performance equation.  How does your memory controller or controllers factor into the system performance equation?  Do you have too much latency on your processor to memory datapath?  Have you overdesigned the system in such a way that the data comes back quickly but you’re burning too much power?  What is the impact of this on your backend layout?  Does your interpretation of how the arbitration priority of your memory controller match what will really happen on the bus when all of the components are tied together? 

                                           

You can’t underestimate the impact of cache sizing and layout on your system performance as well, especially with the new AMBA Coherency extensions.  A few cache sizing and configuration decisions can have a dramatic impact on system performance and resultant snoop traffic.  How will your cache size impact system performance?  How much will that impact change when you start running different software?

Choosing and configuring your IP is only the start of your system design problem.  A system is far more than just hardware and there needs to be a way to get all the layers of the software stack up and running on the system as well, preferably well before silicon is taped out.  This is simple enough for high level application software but what about your firmware, drivers, diagnostics and other software that actually depends upon the hardware functionality of your new system?  How will you validate the impact of software on system level performance and power consumption?

There are typically substantially more engineers writing this software than there were hardware engineers to create and verify the hardware.  How will they all be enabled to develop and debug their portion of the system design?  When they uncover hardware problems (or at least problems that they blame on the hardware!) how will these problems be debugged?  Will the problem be thrown over the wall several times as fingers are pointed or will there be a common debug mechanism to enable the hardware and software engineer to work together?  Will that solution be affordable enough for everyone to use it or will it be a scarce, productivity-limiting resource with a signup sheet filled 24 hours a day? 

Finally, what happens if you want to enable your end customers to be productive on my system design before silicon is produced?  After all, they may need to write software for the system too and if they have to wait until after silicon to accurately develop this software you’ll need to wait longer to start making money on my design.  If this whole process takes too long, the design may even miss its market window entirely. 

                                            

These questions are not unique, nor are they new.  They’ve been growing in importance however.  As more and more chip designs migrate from custom ASICs to SoCs, the system-level design problems which used to be faced by just a few teams are now faced by almost everyone.  There are a variety of solutions in the market to address these problems ranging from RTL simulation to high level virtual prototypes to all manner of hardware prototypes.  Over the course of the next few weeks we’ll look into each of these areas in more detail, discuss the questions that our customers are asking and show how they’re getting answers that they can trust.

                                            

Tags: ARM Cortex-A15, Architectural Optimization, ARM Cortex-A9, Arteris FlexNoC, ARM Performance Analysis, ARM Cache Optimization, Bill Neifert, IP Selection

FREE Insights & Analysis

Get the world's leading discussions on virtual prototyping delivered to your inbox

FREE Virtual Prototype Resources

 

 

Follow Carbon