Ever since its announcement in the Fall of 1991, the Alpha [SW92] architecture has been the foundation of the world's fastest systems. In fact, except for one or two brief blips, Alpha systems have been the highest-performing systems based on single-CPU SPECmark performance. But with this outstanding performance record comes marketing hype and sometimes unrealistic expectations. It is not all that uncommon to find email messages or USENET articles of the form ``I heard the Alpha is so fast, but now I find that my dusty deck is just 10% faster on the Alpha than on the other system.'' So what's the truth? The honest answer is that it depends on what you're doing. Alpha systems are without a doubt fast machines, but it is unreasonable to expect that taking a dusty deck and running it on an Alpha will result in the best possible performance. This is particularly true for programs that were written with the mind-set of the eighties, when CPU cycles were at a premium and memory bandwidth was abundant. Reality looks quite different nowadays: CPU clock-rates above 150MHz are the rule and even laptops can run at 200MHz or more. The result is that today the memory system and not the CPU is often the first-order bottleneck.
The purpose of this paper is to demonstrate a few simple techniques that help avoid the memory system bottleneck. Except for one case, the focus is on integer-intensive applications. The topic of optimizing floating-point intensive applications is certainly just as important, but unfortunately well beyond the scope of this paper.
The techniques presented can result in tremendous performance improvements. While the techniques will be helpful for all modern systems, they normally extract the biggest benefits on Alpha based machines. There are several reasons for this:
Before diving into the optimization techniques, this paper gives a brief overview of existing and upcoming Alpha implementations. While it is not usually necessary to optimize for a specific CPU, it is helpful to know what the characteristics of current CPUs and systems are. Section 3 presents a couple of simple performance analysis tools that are available under Linux. When porting legacy code to modern systems, such tools are invaluable since they avoid wasting time trying to optimize rarely executed code. Section 4 then presents the optimization techniques and gives a few examples of the dramatic performance improvements they can achieve. Finally, in Section 5 we present some conclusions.