Finally we're at the point where we're ready to look into the techniques that can greatly improve the performance of a given piece of code. Most of them are not novel. Some of them have been around for so long that it would be difficult, if not impossible, to give proper credit. Others are ``obvious'' (once you know them...). But the key point here is that it is the characteristics of modern, and in particular Alpha-based, systems which make these techniques so important and worthwhile.
While the focus of this section is on the Alpha architecture, we would also like to give an idea of how the same techniques work on other architectures. The tricky part is that we want to do this without igniting a war over which CPU architecture is the ``best'' or the ``fastest.'' Such terms are, to a good degree, meaningless since they can be applied usefully only relative to a given problem. For this reason, performance results are presented as follows: for the Alpha we present both absolute and relative results. The absolute numbers are useful to give a concrete feel for how fast the code is. The relative results (i.e., speed ups) is what tells us how well a given technique works. Where meaningful, we also list the speed up (but not the absolute performance) achieved on a Pentium Pro based system. Only listing speed up for the Pentium Pro case makes it impossible to tell which system was faster on a given problem while still allowing us to compare the relative benefits. (To avoid any misconception: this arrangement was not chosen because the Alpha performed poorly; suffice it to say that the author has been using Alphas for some time now and is generally pleased with the performance level they achieve.)
Since an architecture per se doesn't perform at all, let us be a bit more specific about the systems used for the measurements:
It is also illustrative to compare gcc to commercial Alpha compilers, such as Digital's GEM C compiler. The GEM C compiler usually generates somewhat better code but every once in a while it creates code much faster than gcc's (this usually happens on floating point intensive code). For this reason, some of the measurements also include the results obtained with GEM C. This compiler was invoked as cc -migrate -O4 -tune ev5 -std1 -non_shared. It's not clear what the version of the compiler version was---it was whatever came with Digital Unix version 3.2.
The Alpha architecture does not provide an integer division instruction. The rationale for this is that (a) such operations are relatively rare and (b) division is fundamentally of an iterative nature so implementing it in hardware is not all that much faster than a good software implementation [Alv91]. Nevertheless, there are important routines that depend on integer division. Hash-tables are a good example: computing a hash-table index typically involves dividing by an integer prime constant.
There are basically two ways to avoid integer division. Either, the integer division is replaced by a floating point division or, if the division is by a constant, it is possible to replace the division by a multiplication with a constant, a shift, and a correction by one (which isn't always necessary). Floating point division may sound like a bad idea, but the Alpha has a very fast floating point unit and since a 32 bit integer easily fits into a double without loss of precision, it works surprisingly well. Replacing a division by a constant with a multiplication by the inverse is certainly faster, though it's also a bit tricky since care must be taken that the result is always accurate (off-by-one errors are particularly common). Fortunately, for compile time constants, gcc takes care of this all by itself.
To illustrate the effect this can have, we measured how long it takes to lookup all symbols in the standard C library (libc.so) using the ELF hash-table lookup algorithm (which involves one integer division by a prime constant). With integer division, roughly 1.2 million lookups per second can be performed. Using a double multiplication by the inverse of the divisor instead brings this number up to 1.95 million lookups per second (62% improvement). Using integer multiplication instead gives the best performance of 2.05 million lookups per second (70% improvement). Since the performance difference between the double and the integer multiply-by-inverse version isn't all that big, it's usually better to use the floating pointer version. This works perfectly well as long as the operands fit in 52 bits and avoids having to worry about off-by-one errors.
Two parts in modern CPUs that are easy to forget about are the data
and instruction translation lookaside buffers (TLBs). The TLBs hold
the most recently accessed page table entries. TLBs are necessary
since it would be far too slow to access the page tables on every
memory access. Since each TLB entry maps an entire page (8KB with the
current Alpha chips), the TLB is usually not the limiting factor to
performance. The catch is that when a program does suffer from
excessive TLB misses, performance will go down the drain fast. In
such a case, the slow down may easily be big enough that it is
worthwhile to switch to a completely different (normally slower)
algorithm that has a better TLB behavior. For example, on the Alpha
system reading one word from each page in an array of 63 pages (504KB)
takes about 15ns per access. But doing the same in an array of 64
pages takes 65ns per access---a slowdown of more than a factor of
four! Since the second-level cache in that system is 4MB large, this
jump in access time is purely due to the data TLB.
As said above, the TLB is not usually a first-order bottleneck but a small experiment did show that a hash-table that is 50% full and exceeds the data TLB size is no faster than a more compact search tree that needs two memory accesses per lookup (the hash-table needs just one memory access but since it exceeds the TLB size, that one access is slow). In general, the TLB may be the primary bottleneck when large data sets are accessed more or less randomly and sparsely.
On modern systems, memory accesses are bad and computation is good. In the ideal case we would like to completely replace memory accesses by computation. This obviously is not always possible; but where it is possible, the payoff can be big.
For example, let us consider the problem of reversing all the bits in each byte in a long. A byte is reversed as follows: bit 0 is swapped with bit 7, bit 1 with bit 6 and so on. Since we want to reverse all the bytes in a long, this algorithm is applied once for each byte in the long.
Why would anyone want to do this? As you may know, both the Alpha and the x86 architecture are little-endian, but when IBM designed the VGA graphics adapter, it chose to use a big-endian bit order for the pixels in the graphics memory. That is, bit 7 in a byte corresponds to the left-most pixel and bit 0 to the rightmost. This is backwards since the coordinate value for the left-most pixel is smaller. So, byte-reversal is indeed a relatively important operation for VGA X servers.
The traditional way of implementing byte-reversal might look something like the code below. To conserve space, we show the code to reverse a 4-byte integer only. The 8-byte long case is a straight-forward extension of this. The code assumes that array byte_reversed has been initialized such that byte_reversed[i] is the reversed value of i.
u_int byterev_naive (u_int L) { /* this is for sizeof(u_int)==4 */ return (((u_int)byte_reversed[(L >> 0) & 0xff] << 0) | ((u_int)byte_reversed[(L >> 8) & 0xff] << 8) | ((u_int)byte_reversed[(L >> 16) & 0xff] << 16) | ((u_int)byte_reversed[(L >> 24) & 0xff] << 24)); }
With this naive algorithm, each byte-reversal involves a table-lookup. Since the Alpha is a 64 bit architecture, this means that each long reversal involves eight memory accesses. How could we avoid these expensive memory accesses? Well, a simple and arguably more intuitive way to implement byte reversal is to use shifting and masking to swap the bits. The code that does this would look something like this:
Note that, except for the initialization of mask, this code is completely independent of the width of a long! So, to make this work on a 32 bit machine, all we need to do is initialize mask with 0x01010101 instead.u_long byterev_long (u_long L) { u_long mask = 0x0101010101010101; return (((L & (mask << 0)) << 7) | ((L >> 7) & (mask << 0)) | ((L & (mask << 1)) << 5) | ((L >> 5) & (mask << 1)) | ((L & (mask << 2)) << 3) | ((L >> 3) & (mask << 2)) | ((L & (mask << 3)) << 1) | ((L >> 1) & (mask << 3))); }
Now let's see how the implementations compare. The table below gives the results for the Alpha, the Pentium Pro (P6), and a 120MHz Pentium Notebook (P5). In addition to the two implementations shown above, the table also includes a row that shows the performance of the naive implementation when coded in x86 assembly code (implementation byterev_x86). This routine comes straight from the XFree86 v3.2 distribution.
Alpha | P6 | P5 | ||
Implementation | abs | rel | rel | rel |
byterev_naive | 55MB/s | 1.00 | 1.00 | 1.00 |
byterev_long | 72MB/s | 1.31 | 1.13 | 1.17 |
byterev_x86 | n/a | n/a | 0.56 | 1.51 |
As the table shows, the byte-reversal routine that avoids memory accesses is over 30% faster on the Alpha. Interestingly, on the P6 this routine is also fastest. The benefit there is less (``only'' 13%), but given that it's more portable and more intuitive, there is no reason not to use it. What's stunning is the complete failure of the assembly version on the P6. That routine is only half as fast as the version that avoids memory accesses. This may be due the assembly routine's extensive use of byte accesses to CPU registers. For the Pentium, as shown in the P5 column, the assembly coded routine is the fastest version. But don't be misled by the relative performance numbers: in terms of absolute performance, the Pentium is just half as fast as the P6.
To summarize, not only is byterev_long() by far the fastest version on the Alpha, but even on a P6 it appears like the right solution.
One reason why gcc sometimes generates significantly worse code than Digital's GEM compiler is that it does not perform inter-procedural alias analysis. What this means is that gcc's alias analysis is sometimes unnecessarily conservative. To illustrate the problem, consider the function below. It is a simple unrolled loop that reverses all the bytes in array src and stores the result in array dst.
void vecrev_naive (long *src, long *dst, long size) { long i; for (i = 0; i < size/sizeof(long); i += 4) { dst[i + 0] = byterev_long(src[i + 0]); dst[i + 1] = byterev_long(src[i + 1]); dst[i + 2] = byterev_long(src[i + 2]); dst[i + 3] = byterev_long(src[i + 3]); } }
The problem with this code is that the C compiler has no way of knowing how the src and dst pointer relate to each other. For all it knows, dst could point to the second element of the array pointed to by src. When this happens, the two pointers refer to overlapping regions of memory and they are said to alias each other. For the compiler, it is important to know whether the two pointers overlap since that determines the degree of freedom it has in reordering instructions. For example, if the regions overlap, then storing a value to dst[i+0] may affect the value of src[i+1]. Thus, to be on the safe side, the compiler must generate the loads and stores in the above function strictly in the order in which they occur in the source code.
Now, if it is known that the two arrays passed to the function never alias each other, then we can lend gcc a hand by explicitly telling it so. We can do this by first reading all the values from the memory, then doing all the computation and finally storing the results back to memory. Thus, the above code would be transformed into the code shown below.
void vecrev_sep (long *src, long *dst, long size) { unsigned long L0, L1, L2, L3; long i; for (i = 0; i < size/sizeof(long); i += 4) { /* loads: */ L0 = src[i + 0]; L1 = src[i + 1]; L2 = src[i + 2]; L3 = src[i + 3]; /* computation: */ L0 = byterev_long(L0); L1 = byterev_long(L1); L2 = byterev_long(L2); L3 = byterev_long(L3); /* stores: */ dst[i + 0] = L0; dst[i + 1] = L1; dst[i + 2] = L2; dst[i + 3] = L3; } }
Since the stores all happen at the end, gcc knows right away that none of the stores may affect any of the preceding loads. This provides it complete freedom in generating the best possible code for the loads and computation (the assumption here is that byterev_long is an inlined function).
On the Alpha and most other architectures with lots of registers (e.g., most RISCs), this kind of code never (or at least very rarely) hurts performance and usually improves performance for gcc. Unfortunately, the same isn't true for the x86 architecture. The problem there is that only few registers are available. So the code that's better for RISCs is usually worse for the x86 due to additional stores and loads that are necessary to access the temporaries that ended up on the stack. To witness, performance numbers for the two versions are given below:
Alpha | P6 | ||
Implementation |
abs | rel | rel |
vecrev_naive | 82MB/s | 1.00 | 1.00 |
vecrev_sep | 93MB/s | 1.13 | 0.73 |
By now you may be wondering where the promised order of magnitude improvements are. Let's consider a simple problem: matrix multiplication. While simple, it is typical of a problem that is considered floating-point intensive. In reality, most floating-point intensive problems are also very memory intensive. E.g., they process large vectors or matrices. Thus, the memory access pattern often plays a crucial role in achieving the best possible performance. The textbook implementation of matrix multiplication looks roughly like this:
Here, the matrix pointed to by a gets multiplied by the matrix pointed to by b and the result is stored in c. The matrix dimension is passed in argument dim. On the Alpha, a 512 × 512 matrix multiply with this routine executes at about thirteen million floating-point operations per second (MFlops). This is not too shabby, but let's see whether we can squeeze more out of the machine. Having learned our lesson in performance optimization, we might try to unroll the inner loop and avoid all multiplications due to indexing. This does indeed result in a faster version: now, matrix multiply executes at about 15MFlops.void matmul0 (int dim, float *a, float *b, float *c) { int i, j, k; float dot; for (i = 0; i < dim; ++i) for (j = 0; j < dim; ++j) { dot = 0.0; for (k = 0; k < dim; ++k) dot += a[i*dim + k] * b[k*dim + j]; c[i*dim + j] = dot; } }
Rather than declaring the problem solved, let's think about the memory access pattern for a minute. Each element in the result matrix c is a dot product of a row in a and a column in b. For example, the element c[0][0] is computed as the dot product of the first row in a and the first column in b. This is illustrated in Figure 2. In our naive matrix multiply routine, this means that the accesses to a causes a nice, dense, linear memory access pattern. Unfortunately, things do not look quite as good for b. There the memory access pattern is sparse: first, the element at offset 0 is read, then the one at offset dim and so on. Such sparse access patterns are bad for many reasons. Suffice it to say that it's easiest to optimize a machine for dense, linear accesses, so it is likely that those accesses will always be the fastest ones. Fortunately, there is a simple trick that avoids the bad access pattern for matrix b: before doing the actual matrix multiply, we can simply transpose matrix b. Then, all memory accesses are dense. Of course, transposing b causes extra work, but since that matrix is accessed dim times, this may well be worth the trouble.
So, let's change matmul0 into matmul1 by adding a matrix transposition in front of the main loop. The code below assumes that tb is an appropriately sized temporary variable to hold the transposition of b.
If you thought 15MFlops is fast, think again: matmul1 executes at a blazing 45Mflops! Are we done yet? How about we try some loop unrolling, etc.? For matmul0, this bought us 25%, which isn't bad at all. If we unroll the loop eight times and do some other straight forward optimizations, the code may look like the one shown below. For compactness, it assumes that dim is an integer multiple of eight.void matmul1 (int dim, float *a, float *b, float *c) { int i, j, k; float dot; /* transpose b: */ for (i = 0; i < dim; ++i) for (j = 0; j < dim; ++j) tb[i*dim + j] = b[j*dim + i]; ...rest like matmul0, except that b is replaced by tb... }
Was it worth the trouble? By all means yes: matmul2 clocks at a full 80MFlops! Whether you like this kind of code or not may be a matter of taste, but it certainly is fast!void matmul2 (int dim, float *a, float *b, float *c) { float dot0, dot1, dot2, dot3, a0, a1, a2, a3, b0, b1, b2, b3; float dot4, dot5, dot6, dot7, a4, a5, a6, a7, b4, b5, b6, b7; long I, J, i, j, k; ...transpose b into tb like in matmul1...; for (I = 0; I < dim*dim; I += dim) for (j = J = 0; j < dim; ++j, J += dim) { dot0 = dot1 = dot2 = dot3 = dot4 = dot5 = dot6 = dot7 = 0.0; for (k = 0; k < dim; k += 8) { a0 = a[I + k + 0]; b0 = tb[J + k + 0]; a1 = a[I + k + 1]; b1 = tb[J + k + 1]; a2 = a[I + k + 2]; b2 = tb[J + k + 2]; a3 = a[I + k + 3]; b3 = tb[J + k + 3]; a4 = a[I + k + 4]; b4 = tb[J + k + 4]; a5 = a[I + k + 5]; b5 = tb[J + k + 5]; a6 = a[I + k + 6]; b6 = tb[J + k + 6]; a7 = a[I + k + 7]; b7 = tb[J + k + 7]; dot0 += a0 * b0; dot1 += a1 * b1; dot2 += a2 * b2; dot3 += a3 * b3; dot4 += a4 * b4; dot5 += a5 * b5; dot6 += a6 * b6; dot7 += a7 * b7; } c[I + j] = dot0 + dot1 + dot2 + dot3 + dot4 + dot5 + dot6 + dot7; } }
The table below presents a summary of the performance results. For comparison, it also includes the results obtained when compiling the same code with Digital's GEM C compiler and the relative results for the P6.
Alpha | P6 | ||||
gcc | GEM C | ||||
Implementation | abs | rel | abs | rel | rel |
matmul0 | 11MFlops | 1.00 | 13MFlops | 1.00 | 1.00 |
matmul1 | 47MFlops | 4.27 | 66MFlops | 5.08 | 2.53 |
matmul2 | 80MFlops | 7.27 | 105MFlops | 8.08 | 3.97 |
Multimedia applications are the rage these days. All mainstream CPU architectures (with the notable exception of the PowerPC) have so-called multimedia extensions. The Alpha is ideally suited for such applications since it has been a 64 bit architecture right from the start. In fact, the Alpha multimedia extension is completely trivial: it adds only four new instruction types (vector minimum/maximum, pixel error, pack and unpack). Since some of these instructions can operate on different data types (byte or word; signed or unsigned), the total number of instructions added is 13, which is much smaller than the corresponding number for other architectures. The reason the Alpha got away with so few additions is because its original instruction set already contains many of the instructions needed for multimedia applications. For example, there is an instruction that allows eight bytes to be compared in parallel---a seemingly simple instruction that can prove surprisingly powerful in a number of applications.
We illustrate this using mpeg_play, the Berkeley MPEG
decoder.
This loop executes at about 94ns per byte-average (iteration) when compiled with gcc. Unrolling this loop twice and reading ahead the input values needed in the next iteration yields code that is probably close to optimal with this byte-oriented approach. Indeed, with gcc, performance increases to about 60ns per byte-average (let's call this unrolled version of the function byte_avg1).void byte_avg0 (u_long size, u_char *a, u_char *b, u_char *c) { int i; for (i = 0; i < size; ++i) c[i] = (a[i] + b[i]) / 2; }
To get even higher performance, we need to be a bit more aggressive. Considering that the Alpha is a 64 bit architecture, it sure would be nice if we could calculate the average of eight bytes in parallel. Reformulating byte-oriented algorithms in such a data parallel format is often trivial (see Section 4.3). For byte-averaging, it's not quite as simple: the straight forward implementation requires nine bits of precision, since 255+255=510. But if we pack 8 bytes into a 64 bit word, there is no extra ninth bit. How can we get around this? Obviously, we can divide the operands by two before adding them. That way, the sum is at most 127+127=254 which conveniently fits into 8 bits. But the catch is that the result may be wrong: if both operands are odd, it will be one too small. Fortunately, it's easy to correct for this: if bit 0 in both operands is set, a correction by one is necessary. In other words, we can make space for that extra ninth bit by using an additional long register that is used to hold the correction bits. Since all intermediary results now fit into 8 bits, the obstacles to a data-parallel implementation of byte-averaging have been removed. The resulting code is shown below. For simplicity, it assumes that the input vectors are long aligned and have a size that is an integer multiple of the size of a long:
Note that macro VEC() takes an eight bit value and replicates it once for each byte in a long---it's much more convenient and less error-prone to write VEC(0x01) instead of 0x0101010101010101. Maybe it's helpful to explain the core of the averaging a bit: variable CC holds the correction bits, so it's simply the bitwise AND of vectors A0 and B0, masked with a vector of 0x01. We divide A0 and B0 by two by shifting them to the right by one position and masking the resulting long with a vector of 0x7f. This masking is necessary since otherwise bit 0 of the byte ``above'' a byte would sneak in and become bit 7 of that byte, which would cause gross errors. The average is computed by simply adding the vectors A0, B0, and CC. This addition does not cause any overflows since, per byte, the largest possible value is 127+127+1=255.#define CAT(v,x) ((u_long)(v)<<8 | (x)) #define VEC(x) CAT(CAT(CAT(CAT(CAT(CAT(CAT(x,x),x),x),x),x),x),x) void byte_avg2 (u_long size, u_char *ca, u_char *cb, u_char *cc) { u_long *a = (u_long*) ca; u_long *b = (u_long*) cb; u_long *c = (u_long*) cc; u_long A0 = a[0], B0 = b[0], A1, B1, CC, i, count = size / sizeof(u_long); for (i = 1; i < count; ) { A1 = a[i]; B1 = b[i]; ++i; /* read ahead */ CC = (A0 & B0) & VEC(0x01); A0 = (A0 >> 1) & VEC(0x7f); B0 = (B0 >> 1) & VEC(0x7f); c[i - 2] = A0 + B0 + CC; A0 = a[i]; B0 = b[i]; ++i; /* read ahead */ CC = (A1 & B1) & VEC(0x01); A1 = (A1 >> 1) & VEC(0x7f); B1 = (B1 >> 1) & VEC(0x7f); c[i - 2] = A1 + B1 + CC; } }
Despite its look, this code is actually very portable. For a 32-bit architecture, all that needs to change is macro VEC (and even that is necessary only to get rid of compiler warnings). Byte-order is not an issue since even though the data is accessed a long at a time, each byte is still processed individually. This is all nice and fun, but lets not forget the real reason why we tried this: performance. This data parallel version of the byte-averaging loop runs at 5.3ns per byte-average---more than an order of magnitude faster than the unrolled loop!
A summary of the three averaging routines is given in the table below. The relative performance is in terms of throughput (number of byte-averages per second) since that's both more intuitive and more impressive. Results for the Alpha are presented both for gcc and Digital's GEM C compiler; as usual, for the P6, gcc was used.
Alpha | P6 | ||||
gcc | GEM C | ||||
Implementation | abs | rel | abs | rel | rel |
byte_avg0 | 94ns/avg | 1.00 | 69ns/avg | 1.00 | 1.00 |
byte_avg1 | 60ns/avg | 1.57 | 56ns/avg | 1.23 | 0.82 |
byte_avg2 | 5ns/avg | 18.80 | 4ns/avg | 17.25 | 1.93 |
But how does all this affect performance of mpeg_play? This
is best illustrated by comparing the original Berkeley version with
the one optimized using the techniques described in this section
(particularly data-parallel processing and avoiding integer
divisions).
Version | Dither | Framerate |
Original | none | 45.3 frames/sec (1.00) |
Optimized | none | 139.8 frames/sec (3.09) |
Original | ordered | 26.9 frames/sec (1.00) |
Optimized | ordered | 58.9 frames/sec (2.19) |
Optimized | ordered4 | 98.2 frames/sec (3.65) |