Tips, Ideas, Discussion for SOC Virtual Prototypes

Cortex-A9 Cache Optimization (Part 3 of 3)

Posted by Toshihisa Oishi on Tue, Jun 26, 2012 @ 08:30 AM

In the previous two installments we created and configured our virtual prototype and analyzed the performance of the L1 cache.  In this final installment we'll do some optimization of the L2 cache and how it interacts with the DDR controller.

Let's take a look at the graph below. This time, L2 cache has been installed and enabled. The size of the working set of the program used in this analysis was about 80KB. So as to fit this into the L2 cache, 128KB was chosen as L2 size. The number of hits and misses of instruction and data are shown respectively in this graph.

 

 Cache Analysis Hits and Misses Fig 1

 

It's obvious that the Cortex-A9 does a far better job of staying in cache. As seen before, an interesting phenomenon has been captured this time, as well. Take a look at the I-cache hits for the ARM1176 for 16k and 32K L1 sizes (remember that we're looking at the statistics here for the L2 cache as the L1 sizes are varied so a hit or miss represents a miss in the L1 cache)  The dataset size for the program is causing the L1 cache to thrash.  The execution characteristics for the A9 are much more in line with what would typically be expected.

In the next graph, the number of cycles of the simulation are shown. The red bar represents the number of cycles that the DDR3 are accessed. If L2 cache is installed and enabled, the size of the L1 cache has little effect on the overall number of cycles. In other words, we can choose the minimum size for the L1 cache for both configurations since the L2 has such a dramatic impact on the DDR3 cycles.  

Virtual Prototype Cycles Required Fig 2

The next graph shows the number of times that the ARM1176 accesses the DDR3 memory, and each access is classified with the transfer size. Since DDR3 memory transfers data in burst manner, transferring a small piece of the data less than 32 bytes involves penalty. For instance, even in the transfer of one byte, one burst (i.e. 8 beats) has to be consumed.

DDR3 access types Fig 3

If L2 is not enabled, the number of times of DDR3 access gets very high. Furthermore, a small piece of the data other than 32 bytes, e.g. 2 and 4 bytes, are transferred. Judging from this result, we can know how important L2 cache is. BP in the graph stands for branch prediction. The effect of BP appears if L2 is not used. On the contrary, if L2 is used, it becomes somewhat penalty. These results can be observed only if you use 100% cycle accurate model.

Over three times, we've seen the actual cache analysis. I hope you can find useful tips about cache analysis from this case study.

 To learn more about other case studies of performance analysis, please check our blogs.

Tags: CPAK, SoCDesigner Plus, ARM Cortex-A9, Carbon Performance Analysis Kit, ARM Performance Analysis, ARM Cache Optimization, ARM1176

FREE Insights & Analysis

Get the world's leading discussions on virtual prototyping delivered to your inbox

FREE Virtual Prototype Resources

 

 

Follow Carbon