Pentium Optimizations
Pentium Optimization article by Agner Fog.

Feature Article Examples Cyrix 6x86 Notes Other References

Feature Article


Agner Fog wrote up the results of his findings on Pentium performance in the form of an Optimization tutorial, originally posted to the newsgroup comp.lang.asm.x86. It is highly detailed, and to the best of my knowledge, very accurate. I have only seen this depth of information on the subject in expensive books. In my own personal quest to know more about Pentium optimizations, I've realized that the biggest holes in my knowledge were regarding cache activity, concurrent FPU operations and branch prediction. These subjects as well as other advanced topics have been covered here quite well.

As discussed on comp.lang.asm.x86, the information here tends to be more accurate than even Intel's own documentation because it was all derived through accurate timing methods by an unbiased independent body (namely Agner Fog, Karki Jitendra Bahadur, and others.) Intel arrived at its documentation via miscommunication between their engineers and writers (mostly interns from college) in an evironment where public dissemination of information has not been taken seriously.

This latest version, has significant updates to the branch prediction mechanism. An HTML-ification of the file is also available. In future, this html version will make better use of the html to be more convenient to browse.

It is presented here with permission. Updated 08/19/97. [PH]

Update: Agner Fog has written up a more detailed account of Intel's branch prediction strategy. You can see Agner's latest article at Robert Colin's Intel Secret's site.


Examples
Example 1

What follows is a comparison between two inner loops used to calculate an iteration of the Mandelbrot set using the FPU. It is very similar to the algorithm used by the program FracInt. Although FracInt has a reputation for being a highly optimized program, a careful examination and application of the ideas of Agner's article given above shows that there is room for improvement.


Before

int brot(double x,double y,
         double rad);
#pragma aux brot =          \
"       fldz              " \
"       fldz              " \
"       mov   ecx,MI      " \
"brot1: fld   st(0)       " \
"       fmul  st,st(2)    " \
"       fadd  st,st(0)    " \
"       fadd  st,st(4)    " \
"       fxch  st(2)       " \
"       fmul  st,st(0)    " \
"       fxch              " \
"       fmul  st,st(0)    " \
"       fld   st(0)       " \
"       fadd  st,st(2)    " \
"       fcomp st(6)       " \
"       fstsw ax          " \
"       test  ah,45h      " \
"       jz    short brot2 " \
"       fsubr             " \
"       fadd  st,st(2)    " \
"       dec   ecx         " \
"       jnz   brot1       " \
"       jmp   short brot3 " \
"brot2: fstp  st(0)       " \
"brot3: fstp  st(0)       " \
"       fstp  st(0)       " \
"       fstp  st(0)       " \
"       fstp  st(0)       " \
"       fstp  st(0)       " \
parm [8087] [8087] [8087]   \
modify [8087] value [ecx];

After

static float ftmp;

int brot(double x, double y);
#pragma aux brot =             \
"       fld    st(1)         " \
"       fld    st(1)         " \
"       mov    ecx,MI        " \
"       mov    edx,40800000h " \
"brot1: fld    st(0)         " \
"       fmul   st,st(0)      " \
"       fxch   st(2)         " \
"       fld    st(0)         " \
"       fmul   st,st(0)      " \
"       fxch   st(2)         " \
"       fmulp  st(1),st      " \
"       fxch   st(2)         " \
"       fld    st(0)         " \
"       fadd   st,st(2)      " \
"       fxch   st(1)         " \
"       fsubrp st(2),st      " \
"       fxch   st(2)         " \
"       fadd   st,st(0)      " \
"       fxch   st(2)         " \
"       fstp   ftmp          " \
"       fadd   st,st(2)      " \
"       fxch   st(1)         " \
"       fadd   st,st(3)      " \
"       fxch   st(1)         " \
"       cmp    ftmp,edx      " \
"       jle    brot2         " \
"       dec    ecx           " \
"       jnz    brot1         " \
"brot2: fstp   st(0)         " \
"       fstp   st(0)         " \
"       fstp   st(0)         " \
"       fstp   st(0)         " \
parm [8087] [8087]             \
modify [8087 edx] value [ecx];

The before listing is similar to the one which exists in FracInt. Although the after listing has a much longer inner loop, it is much faster, primarily for the following reasons:

The loop was also slightly changed architecturally. The radius, r, which was passed in before, is now assumed to be the constant 4.0, which is the one typically used to calculate the Mandelbrot set and the loop is initialized as if it has already run one iteration.

Although I did not time this routine specifically, the overall improvement of my test application went up about 40 percent (this is a lower bound on the performance improvement, since my test application has a lot of other overhead.) These techniques are not being used in the current (02/11/98 v19.6) sources of FracInt; my test program can generate Mandelbrot sets 10 times faster than FracInt. (I seem to be doing somewhat better than Fractal Explorer and Xaos as well.)

Mandelbrot demo and benchmark program

There was an in depth discussion of this loop on comp.lang.asm.x86 involving some good submissions by Patrick Dostert, Andrew Howe and Terje Mathisen. Patrick submitted a solution that ran with similar performance to the solution I give above, however Andrew submitted a solution that appeared to be two clocks faster, and Terje one clock faster still. Terje reasoned out that his cycle efficiency was optimal given what the algorithm had to compute. Here is Terje's loop:

; /*  Mandelbrot inner loop, optimized Pentium asm.  */
; /*  Written by Terje Mathisen 1997.                */

static float ftmp;

int brot(double x, double y);
#pragma aux brot =           \
"       mov edx,0x40800000 " \
"       mov ecx,MI         " \
"       mov ftmp,edx       " \
"       fld st(0)          " \
"       fld st(2)          " \
"brotloop:                 " /* b a x y                 */ \
"       fld st(1)          " /* a b a x y               */ \
"       fmul st,st         " /* aa b a x y              */ \
"       fld st(1)          " /* b aa b a x y            */ \
"       fmul st,st         " /* bb aa b a x y           */ \
"       fld st(1)          " /* aa bb aa b a x y        */ \
"       fadd st,st(5)      " /* aa+x bb aa b a x y      */ \
"       fxch st(4)         " /* a bb aa b aa+x x y      */ \
"       fmulp st(3),st     " /* bb aa ab aa+x x y       */ \
"       fadd st(1),st      " /* bb aa+bb ab aa+x x y    */ \
"       fsubp st(3),st     " /* aa+bb ab aa-bb+x x y    */ \
"       fxch st(1)         " /* ab aa+bb aa-bb+x x y    */ \
"       fadd st,st         " /* 2ab aa+bb aa-bb+x x y   */ \
"       cmp ftmp,edx       " /*                         */ \
"       ja  brotexit       " /*                         */ \
"       fadd st,st(4)      " /* 2ab+y aa+bb aa-bb+x x y */ \
"       fxch st(1)         " /* aa+bb 2ab+y aa-bb+x x y */ \
"       fstp ftmp          " /* 2ab+y aa-bb+x x y       */ \
"       dec ecx            " /*                         */ \
"       jnz brotloop       " /*                         */ \
"       fld st(0)          " /*                         */ \
"brotexit:                 " /*                         */ \
"       fstp st(0)         " \
"       fcompp             " \
"       fcompp             " \ 
parm [8087] [8087] modify [8087 edx] value [ecx];

This loop had stood for a reasonable amount of time, and indeed is very well designed. Notice how he has pulled parts of the loop around itself in order to take advantage of more FPU scheduling opportunities. (The FPU stack is shown in the comments with the red entries denoting busy entries.) Indeed this loop has stood long, but ...

Update: Damien Jones has submitted yet another loop that shaves another cycle off of Terje's loop. Fundamentally Damien made the observation that the cmp ftmp,edx instruction takes two clocks because of the memory load of ftmp. So to deal with this he rearranged the loop to decrease the number of flds but introduce a stall between a pair of fmuls, then inserts a mov eax,ftmp between the fmuls (thus making them issue as well as execute two clocks away with the first fmul executing in parallel with the value of ftmp being loaded) then changing the cmp instruction to cmp eax,edx which executes in one clock. This saves a clock overall. Here is Damien's loop:

; /*  Mandelbrot inner loop, optimized Pentium asm.  */
; /*  Written by Damien Jones 1997, 1998.            */

static float ftmp; 
int brot(double x, double y); 
#pragma aux brot =        \ 
"    mov edx,0x40800000 " \
"    mov ftmp,edx       " \
"    mov ecx,MI		" \
"    fld st(1)          " \
"    fld st(1)          " \
"brotloop:		" \
"    fld   st(0)	" /* a a b r i     	       */ \
"    fxch  st(2)	" /* b a a r i     	       */ \
"    fmul  st(2),st(0)	" /* b a ab r i 	       */ \
"    mov   eax,ftmp	" /* 			       */ \
"    fmul  st(0),st(0)	" /* bb a ab r i	       */ \
"    fxch  st(2)	" /* ab a bb r i	       */ \
"    fadd  st(0),st(0)	" /* 2ab a bb r i	       */ \
"    fxch  st(1)	" /* a 2ab bb r i	       */ \
"    fmul  st(0),st(0)	" /* aa 2ab bb r i	       */ \
"    fxch  st(1)	" /* 2ab aa bb r i	       */ \
"    fld   st(2)	" /* bb 2ab aa bb r i	       */ \
"    fsub  st(0),st(4)	" /* bb-r 2ab aa bb r i	       */ \
"    fxch  st(3)	" /* bb 2ab aa bb-r r i	       */ \
"    fadd  st(0),st(2)	" /* bb+aa 2ab aa bb-r r i     */ \
"    fxch  st(1)	" /* 2ab bb+aa aa bb-r r i     */ \
"    fadd  st(0),st(5)	" /* 2ab+i bb+aa aa bb-r r i   */ \
"    fxch  st(3)	" /* bb-r bb+aa aa 2ab+i r i   */ \
"    fsubp st(2),st(0)	" /* bb+aa aa-bb+r 2ab+i r i   */ \
"    cmp   eax,edx	" /* +'ve IEEE float compare   */ \
"    ja    brotexit	" /* cond branch	       */ \
"    fstp  ftmp		" /* aa-bb+r 2ab+i r i	       */ \
"    dec   ecx		" /* 			       */ \
"    jnz   brotloop	" /* 			       */ \
"    fld   st(0)	" \
"brotexit:		" \
"    fstp  st(0)	" \
"    fcompp		" \
"    fcompp 		" \
parm [8087] [8087] modify [8087 edx] value [ecx];

Note that this code has more fxch instructions and as such will perform worse on a 486 or K6 CPU, but overall will perform faster on Pentium, Pentium Pro and Pentium II CPUs.

Well, never say never ... Thomas Jentzsch, believe it or not, found another clock to save. The two loops above have two exit conditions, based on integer flags. However, the dec instruction does not modify the carry flag, and the x86 has numerous "combination flag" branch instructions. Using this insight, he found a way to use only one branch instruction on the inner loop:

; /*  Mandelbrot inner loop, optimized Pentium asm.  */
; /*  Written by Thomas Jentzsch 1998.               */

int brot2(int c);
#pragma aux brot2 =       \ 
"    mov edx,0x40800000 " \
"    mov eax,edx	" \
"    mov ftmp,edx	" \
"    fld ftmp		" \
"brotloop:              " /* R a b r i                 */ \
"    fstp  ftmp         " /* a b r i                   */ \
"    fld   st(0)        " /* a a b r i                 */ \
"    fxch  st(2)        " /* b a a r i                 */ \
"    fmul  st(2),st(0)  " /* b a ab r i                */ \
"    cmp   edx,eax      " /*                           */ \
"    mov   eax,ftmp     " /*                           */ \
"    fmul  st(0),st(0)  " /* bb a ab r i               */ \
"    fxch  st(2)        " /* ab a bb r i               */ \
"    fadd  st(0),st(0)  " /* 2ab a bb r i              */ \
"    fxch  st(1)        " /* a 2ab bb r i              */ \
"    fmul  st(0),st(0)  " /* aa 2ab bb r i             */ \
"    fxch  st(1)        " /* 2ab aa bb r i             */ \
"    fld   st(2)        " /* bb 2ab aa bb r i          */ \
"    fsub  st(0),st(4)  " /* bb-r 2ab aa bb r i        */ \
"    fxch  st(3)        " /* bb 2ab aa bb-r r i        */ \
"    fadd  st(0),st(2)  " /* aa+bb 2ab aa bb-r r i     */ \
"    fxch  st(1)        " /* 2ab aa+bb aa bb-r r i     */ \
"    fadd  st(0),st(5)  " /* 2ab+i aa+bb aa bb-r r i   */ \
"    fxch  st(3)        " /* bb-r aa+bb aa 2ab+i r i   */ \
"    fsubp st(2),st(0)  " /* aa+bb aa-bb+r 2ab+i r i   */ \
"    dec   ecx          " /*                           */ \
"    ja    brotloop     " /*                           */ \
"    jnc   brotexit     " \
"    add   ecx,2        " \
"brotexit:              " \
"    fstp  st(0)	" \
"    fcompp		" \
"    fcompp 		" \ 
parm [ecx] modify [8087 edx] value [ecx];

This loop is indeed the fastest yet for the Pentium, however it is significantly slower on Pentium II's than Damien's loop. As mentioned in Agner Fog's notes above on the P-II, this is due to a "partial register stall", because it has to combine flag results from more than one output result.

Side note on Intel's sample MMX implementation

I would just like to take this moment to point something out. On Intel's MMX developer example web pages, they have posted some code to demonstrate the advantages of using MMX code for precisely this application. I was flabergasted by what I saw there. I don't believe they do a fair comparison. They have written code that is so bad as to be basically unusable for the purposes of any remotely practical Mandelbrot generation. I can't understand why they have done this.

First of all, as is demonstrated by my program, low zooms of Mandelbrot sets are completely uninteresting as they can be computed in less than a couple seconds on even the slowest of Pentium based computers. However, because they choose only 16 bits of precision for their calculations (so that they can do many of them at once using MMX style SIMD) they will not be able to generalize their algorithm for deep zooms (they would lose accuracy very quickly.) The sample image, itself practically betrays the fact that the output quality of their algorithm is quite poor.

Second the comparative FPU based code is a complete sham. There is essentially no pipelining or attempt at ordinary Pentium optimization at all. The code above, (even my code) on a gut feel should execute between 2 and 3 times as fast as the code given in Intel's example. Note that they use FSTSW, they have unnecessary resource contentions on FPU registers, and use FST in inappropriate ways while not once using FXCH to maximally leverage pipelining. Given the totally unpipelined way it has been written, it looks like the AMD K6 could beat the Pentium or Pentium-II on that code by as much as 50%!

I can only imagine this was shown in this way because it is probably not much worse than your typical C compiler so Intel may have figured that programmers would not notice. Problem is, most programmers that can read the code and actually understand the issues well enough to understand how the code works are more likely to know a thing or too about optimization, and therefore know that the code shown is basically junk.

Just to make myself clear I claim that all three routines given by Intel are garbage. The MMX one can't be used for zooms, the FPU one is poorly pipelined and using integer code is just a bad way to try to do it right from the start. In my opinion, Intel scores 3 goose eggs.

Example 2

Continuing with my current obsession with the FPU ... There was a recent USENET posting, where the author was trying to gage the performance of the FPU's multiplication capabilities. After pointing out an obvious flaw in his first attempt (his accumulated multiplies were causing overflow exceptions which significantly slowed the FPU performance), the following C code was arrived at:

#include <stdio.h>

int timeGetTime(void);
#pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx];

void DoFpMult(void) {
    int	i;
    float	val2 = (float)1.00007;
    float	val1 = (float)1.2;
    int		startTime;

    startTime = timeGetTime();

    for (i = 0; i < 1000000; i++) {
	val1 *= val2;
    }

    printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1);
}

Of course it can be optimized with a simple exp/log calculation, but as I explained above, the point of the exercise was primarily to estimate the performance of the FPU's multiplier.

According to VTUNE and the actual results the floating point version was faster than the similarly written integer (fixed point) code by only about 35%. This result appeared to hold true for both the WATCOM C/C++ compiler and the Visual C++ compiler. After disassembling the code, it was fairly clear why the C compilers were not doing better. Here is the inner loop WATCOM C/C++ gave:

L1: fld     st(1)
    inc     eax
    fmul    st,st(1)
    fstp    st(2)
    cmp     eax,1000000
    jl      L1

Although the compiler does seem to realize that it should shuffle integer instructions that dont pair in between FPU instructions (to take advantage of shadowing any incidental stalls that might occur between them,) the problem, of course, is that the FPU stack is being shuffled around unnecessarily. Clearly the ideal inner loop looks like:

L1: fmul    st,st(1)
    dec     ecx
    jne     L1

Unrolling doesn't help since the throughput of the fmul is 3 clocks anyways. This code can be fused back into the original C code using the WATCOM C/C++ compiler as follows:

#include <stdio.h>

int timeGetTime(void);
#pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx];

float FPMUL(float v1, float v2, int count);
#pragma aux FPMUL =   \
" L1: fmul st,st(1) " \
"     dec ecx       " \
"     jnz L1        " \
"     fxch          " \
"     fstp st(0)    " \
value [8087] parm [8087] [8087] [ecx];

void DoFpMult(void) {
    float   val2 = (float)1.00007;
    float   val1 = (float)1.2;
    int     startTime;

    startTime = timeGetTime();

    val1 = FPMUL(val1,val2,1000000);

    printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1);
}

The mechanism for VC++ works differently, requiring you to handle all the passing from float variables to the FPU stack yourself, however it should be possible to have at least the same inner loop. This loop actually doubles the performance making it run 3 times faster than comparable integer code. While not a Pentium specific optimization, I believe it is important to understand how to use the floating point unit, to use the Pentium most effectively, far more so than previous generation x86s. The lesson here is that you should not blindly trust your C compiler even if they advertise that they have Pentium Optimizations.


The Cyrix 6x86
Cyrix has been making slight inroads into Intel's market share with its 6th generation design. This chip has received a lot of press as well as a lot of ear play on the USENET for being faster than the equivalently priced, or clock rated Pentium (Pro).

Jorn Nystad has written up some notes that complement the Cyrix documentation available from their site (see Reference Links below.)


Other References


Please note: I have left off Intel P6/Klamath info, because I cannot predict the URLs at Intel's constantly changing web site. I may change my mind or try to host some P-II (aka Klamath) documentation in the future but don't hold your breath. For now, since the design is so similar, I recommend the K6 documentation, or browsing Intel's site directly for P-II information.

Index General Optimizations x86 Assembly General Programming Mail me!

This page hosted by Geocities Get your own Free Home Page
Updated 03/25/98
Copyright © 1996-1998, Paul Hsieh All Rights Reserved.