Introduction

Introduction

1 Introduction

Ever since its announcement in the Fall of 1991, the Alpha [SW92] architecture has been the foundation of the world's fastest systems. In fact, except for one or two brief blips, Alpha systems have been the highest-performing systems based on single-CPU SPECmark performance. But with this outstanding performance record comes marketing hype and sometimes unrealistic expectations. It is not all that uncommon to find email messages or USENET articles of the form ``I heard the Alpha is so fast, but now I find that my dusty deck is just 10% faster on the Alpha than on the other system.'' So what's the truth? The honest answer is that it depends on what you're doing. Alpha systems are without a doubt fast machines, but it is unreasonable to expect that taking a dusty deck and running it on an Alpha will result in the best possible performance. This is particularly true for programs that were written with the mind-set of the eighties, when CPU cycles were at a premium and memory bandwidth was abundant. Reality looks quite different nowadays: CPU clock-rates above 150MHz are the rule and even laptops can run at 200MHz or more. The result is that today the memory system and not the CPU is often the first-order bottleneck.

The purpose of this paper is to demonstrate a few simple techniques that help avoid the memory system bottleneck. Except for one case, the focus is on integer-intensive applications. The topic of optimizing floating-point intensive applications is certainly just as important, but unfortunately well beyond the scope of this paper.

The techniques presented can result in tremendous performance improvements. While the techniques will be helpful for all modern systems, they normally extract the biggest benefits on Alpha based machines. There are several reasons for this:

The Alpha architecture has been designed with longevity in mind. Specifically, the Alpha architecture should be good for the next 15-25 years, which corresponds roughly to a 1000-fold increase in overall performance. For this reason, some design-tradeoffs were made in favor of long-term viability rather than for short-term benefits. For example, the Alpha was right from the start a 64 bit architecture even though at the time of its announcement, 32 bit address spaces were considered comfortably large.
The current Alpha implementations are designed to achieve high performance by pushing clock frequency to the limit. What this means is that the CPU to memory system performance gap is the largest for Alpha based systems. For example, suppose a memory access takes 100ns. On a 500MHz Alpha CPU, this corresponds to 50 clock cycles. In contrast, on a 250MHz CPU, this is only 25 cycles. So the relative performance penalty of a memory access is much higher on a high-clock CPU. This may sound like a bad thing, but since the absolute performance is the same, what this really means is that a fast-clock CPU system running a memory-bound application will be about as fast as a slower-clock system but when running a memory-wise application, it will be much faster.

Before diving into the optimization techniques, this paper gives a brief overview of existing and upcoming Alpha implementations. While it is not usually necessary to optimize for a specific CPU, it is helpful to know what the characteristics of current CPUs and systems are. Section 3 presents a couple of simple performance analysis tools that are available under Linux. When porting legacy code to modern systems, such tools are invaluable since they avoid wasting time trying to optimize rarely executed code. Section 4 then presents the optimization techniques and gives a few examples of the dramatic performance improvements they can achieve. Finally, in Section 5 we present some conclusions.