Linux Performance Analysis Tools

3 Linux Performance Analysis Tools

The state of Linux performance analysis tools is in rather dire straits (this is true for freeware in general, not just Linux). Commercial products currently have an edge in this area. For example, Digital Unix comes with an excellent tool (or rather tool-generator) called ATOM [SE94]. ATOM is basically a tool that can rewrite any executable. While rewriting, it can add arbitrary instrumentation code to each function or basic-block. Digital Unix comes with a bunch of tools built with ATOM: 3rd degree (a memory-leaks and -bounds checker like the well-known purify) and a number of tools that give very detailed information on the performance behavior of a program (such as cache miss frequency, issue rates, and so on). At present, the freeware community can only dream of such versatile tools.

While bleak, the situation is by no means hopeless. The few tools that are available make for powerful companions when properly used. Even good old GNU gprof has a few features that you may not be aware of---more on this later. Let's start with the most basic performance tool: measuring time.

3.1 Accurately Measuring Time

The Unix way of measuring time is by calling gettimeofday(). This returns the current real time at a resolution of typically one timer tick (about 1ms on the Alpha). The advantage of this function is that it's completely portable across all Linux platforms. The disadvantage is its relatively poor resolution (1ms corresponds to 500,000 CPU cycles on a 500MHz CPU!) and, more severely, it involves a system call. A system call is relatively slow and has the tendency to mess up your memory system. E.g., the cache gets loaded with kernel code so when your program resumes execution, it sees many cache misses that it wouldn't see without the call to gettimeofday(). This is all right for measuring times on the order of seconds or minutes, but for finer-grained measurements, something better is needed.

Fortunately, most modern CPUs provide a register that is incremented either at the clock frequency of the CPU or an integer fraction thereof. The Alpha architecture provides the rpcc (read processor cycle count) instruction. It gives access to a 64 bit register that contains a 32 bit counter in the lower half of the register. This counter is incremented once every N clock cycles. All current chips use N=1, so the register gets incremented at the full clock frequency (but there may be future Alpha processors where N>1). The top half of the value returned by rpcc is operating system dependent. Linux and Digital Unix return a correction value that makes it easy to implement a cycle counter that runs only when the calling process is executing (i.e., this allows to measure the process's virtual cycle count). With gcc, it's very easy to write inlined functions that provide access to the cycle counters:

static inline u_int realcc (void) {
  u_long cc;
  /* read the 64 bit process cycle counter into variable cc: */
  asm volatile("rpcc %0" : "=r"(cc) : : "memory");
  return cc;                    /* return the lower 32 bits */
}

static inline unsigned int virtcc (void) {
  u_long cc;
  asm volatile("rpcc %0" : "=r"(cc) : : "memory");
  return (cc + (cc<<32)) >> 32; /* add process offset and count */
}

With this code in place, function realcc() returns the 32 bit real-time cycle count whereas function virtcc() returns the 32 bit virtual cycle count (which is like the real-time count except that it doesn't count when the process isn't running).

Calling these functions involves very small overheads: the slowdown is on the order of 1-2 cycles per call and adds only one or two instructions (which is less than the overhead for a function call). A good way of using these functions is to create an execution time histogram. For example, the function below measures individual execution times of calls to sqrt(2.0) and prints the results to standard output (as usual, care must be taken to ensure that the compiler doesn't optimize away the actual computation). Printing the individual execution times makes it easy to create a histogram with a little post-processing.

void measure_sqrt (void) {
  u_int start, stop, time[10]; int i; double x = 2.0;
  for (i = 0; i < 10; ++i) {
    start = realcc(); sqrt(x); stop = realcc();
    time[i] = stop - start;
  }
  for (i = 0; i < 10; ++i) printf(" %u", time[i]); printf("\n");
}

Note that the results are printed in a separate loop---this is important since printf is a rather big and complicated function that may even result in a system call or two. If printf were part of the main loop, the results would be much less reliable. A sample run of the above code might produce output like this:

 120 101 101 101 101 101 101 101 101 101

Since this output was obtained on a 333MHz Alpha, 120 cycles corresponds to 36ns and 101 cycles corresponds to 30ns. The output shows nicely how the first call is quite a bit slower since the memory system (instruction cache in particular) is cold at that point. Since the square-root function is small enough to easily fit in the first-level instruction cache, all but the first calls execute at exactly the same time.

You may wonder why the above code uses realcc() instead of virtcc(). The reason for this is simple: we want to know the results that were affected by a context switch. By using realcc(), a call that suffers a context switch will be much slower than any of the other calls. This makes it easy to identify and discard such unwanted outliers.

The cycle counter provides a very low-overhead method of measuring individual clock cycles. On the down side, it cannot measure very long intervals. On an Alpha chip running at 500MHz, a 32 bit cycle counter overflows after just eight and a half seconds! This is not normally a problem when making fine-grained measurements, but it is important to keep the limit in mind.

3.2 Performance Counters

The Alpha chips, like most other modern CPUs, provide a variety of performance counters. These allow measuring various event counts or rates such as the number of cache misses, instruction issue-rate, branch-mispredicts, or instruction frequency. Unfortunately, the author is not aware of any Linux API that would provide access to these counters. This is particularly unfortunate since both the Pentium and the Pentium Pro chips provide similar counters. Digital Unix gives access to these counters via the uprofile and kprofile programs and an ioctl-based interface documented in the pfm(7) man page. Hopefully, something similar (but more general) will eventually become available for Linux. With the proper tools, these counters can provide a wealth of information.

3.3 GNU `gprof`

Most readers are probably familiar with the original gprof [GKM82]. It's a handy tool to determine the primary performance bottlenecks at the function level. However, with the help of gcc, GNU gprof can also look inside a function. We illustrate this with a truly trivial function that computes the factorial. Assume we've typed up the factorial function and a simple test program in file fact.c. We can then compile that program like this (assuming GNU libc version 2.0 or later is installed):

gcc -g -O -a fact.c -lc

Invoking the resulting a.out binary once produces a gmon.out file that contains the execution counts for each basic block in the program. We can look at these counts by invoking gprof with options ``-l --annotate''. This generates a source code listing that shows how many times a basic block in each line of source code has been executed.


*** File fact.c:
          20 -> static unsigned long fact (unsigned long n) {
                    if (n > 1) {
          19 ->         return n * fact(n - 1);
                    } else {
           1 ->         return 1;
                    }
          20 -> }

                int main (int argc, char **argv) {
           1 ->     printf("fact(20)=%lu\n", fact(20));
                    return 0;
                }


Top 10 Lines:

     Line      Count

        1         20
        7         20
        3         19
        5          1
       10          1

Execution Summary:

        5   Executable lines in this file
        5   Lines executed
   100.00   Percent of the file executed

       61   Total number of line executions
    12.20   Average executions per line

Figure 1: Basic-block execution counts

Our factorial example results in the listing shown in Figure 1 and here is how to interpret it: the basic-block starting at the printf line in function main() was executed once, so it has been annotated with a 1. For the factorial function, the function prologue and epilogue were executed 20 times each, so the first and last line of function fact are annotated with 20. Of these 20 calls, 19 resulted in a recursive call to fact and the remaining one simply returned 1. Correspondingly, the then branch of the if-statement has been annotated with 19 whereas the else-statement has an annotation of 1. It's that simple.

There certainly are no surprises in the behavior of function fact(), but in realistic, more complicated functions or in code that was written by somebody else, this knowledge can be very helpful to avoid wasting time optimizing rarely executed code.

3 Linux Performance Analysis Tools

3.1 Accurately Measuring Time

3.2 Performance Counters

3.3 GNU gprof

3.3 GNU `gprof`