Performance optimization by resolving ARM fabric arbitration issues
Posted by
Andy Ladd on Wed, Feb 01, 2012 @ 07:30 AM
When I was a kid, I used to love building replica models of WWII airplanes. I would tromp on down to the local hobby store and would buy the plane that had the coolest picture on the box. When I got home I would spend hours upon hours building the plane and painting the final model to get it to look exactly like the original. Frustration and disappointment were in store if the final model didn't look exactly like what I expected. I was just a kid back then (many say I haven't evolved much since...) but fast forward to the present and jump into the system-level modeling and performance optimization domain and I wonder why engineers can live with model representations of designs that aren't totally accurate?
"Why do I need 100% accurate models?" I don't think I can count the times I've heard this question and wanted to reply, "If you only knew what evil lurks behind inaccurate models".
Here's a good case in point. Proper arbitration of an AXI Fabric is crucial for engineers to design correctly in order to assure the proper Quality of Service (QOS) and performance between the masters (producers) and slaves (consumers) of a fabric. Catastrophic consequences occur if the arbitration scheme is incorrect as we shall see in this example.
In this small example, there are two producers of AXI traffic going to an AXI Fabric. These two producers can represent any two masters (such as a CPU and a GPU device) where one master must have priority over the other to insure the proper QOS. The consumer in this example is a DDR2 memory which services memory requests from the AXI fabric. The fabric is programmed to provide fixed priority arbitration so that Master 1 always has higher priority over Master 2. So when both Master 1 and Master 2 are both generating traffic into the fabric, Master 1 should always have its transactions (read and write accesses) handled first, correct? At first glance this would seem so. Look at the diagram below. This figure represents the read and write traffic coming out of the fabric and going into the DDR2 controller. This profile was taken from the analysis tools in Carbon’s SoC Designer which was simulating the fabric in a realistic system context. You'll notice that the traffic from Master 1 (the taller red, yellow, and green vertical bars) all finish before Master 2's traffic (shorter green vertical bars), just as you would expect.

But hold on! Let's introduce 1 cycle delay of latency in the Memory Controller. Look what happens now in the figure below. Master 1's read and write transactions are interspersed with Master 2's transactions! How can that be with fixed priority arbitration where Master 1 should always have priority?! By just adding 1 cycle of delay, two detrimental consequences occurred: 1) Master 1 no longer finishes its transactions before Master 2 AND instead of only doubling the time it takes Master 1 to finish (as one would expect by adding a 1 cycle delay) it takes 10 times longer for Master 1 to finish!

The reason this occurred was that the 1 cycle delay produced back-pressure on the fabric from the memory controller. This back-pressure caused the input queues servicing Master 1's request to fill up. Once full, the arbiter would allow Master 2 to access the fabric and add transactions to its queue. There was a very small 1 cycle window where Master 2 could get it's transactions into the output queue and thereby into the DDR2 controller before Master 1 finished. So, the back-pressure from the memory controller, caused by adding the 1 cycle delay, was the culprit of the problem.
Originally the user used a "95% accurate" behavioral model to model the fabric and never exposed this issue. Even in the second case with the 1 cycle delay, the model showed Master 1 finishing all of its transactions before Master 2. These inaccurate models failed to fully represent how the back-pressure would cause queuing issues and thereby affect arbitration, QOS, and performance. Only the 100% accurate models (provided by Carbon) were able to expose this problem and eventually saved the customer a re-spin.
So, with the obvious risks and pitfalls of using inaccurate models my question still remains: Why would you ever not use 100% cycle accurate models while doing performance optimization when so much is at stake?
If you are interested in learning more about how Carbon SoC Designer and how Carbon models can be used to analyze and understand AXI fabric issues please download the presentation from LGE.