![]() |
|
Feature Article | Examples | Cyrix 6x86 Notes | Other References |
Feature Article |
Agner Fog wrote up the results of his findings on Pentium performance in the form of an Optimization tutorial, originally posted to the newsgroup comp.lang.asm.x86. It is highly detailed, and to the best of my knowledge, very accurate. I have only seen this depth of information on the subject in expensive books. In my own personal quest to know more about Pentium optimizations, I've realized that the biggest holes in my knowledge were regarding cache activity, concurrent FPU operations and branch prediction. These subjects as well as other advanced topics have been covered here quite well. As discussed on comp.lang.asm.x86, the information here tends to be more accurate than even Intel's own documentation because it was all derived through accurate timing methods by an unbiased independent body (namely Agner Fog, Karki Jitendra Bahadur, and others.) Intel arrived at its documentation via miscommunication between their engineers and writers (mostly interns from college) in an evironment where public dissemination of information has not been taken seriously. This latest version, has significant updates to the branch prediction mechanism. An HTML-ification of the file is also available. In future, this html version will make better use of the html to be more convenient to browse. It is presented here with permission. Updated 08/19/97. [PH] Update: Agner Fog has written up a more detailed account of Intel's branch prediction strategy. You can see Agner's latest article at Robert Colin's Intel Secret's site.
|
Examples |
Example 1 |
What follows is a comparison between two inner loops used to calculate an iteration of the Mandelbrot set using the FPU. It is very similar to the algorithm used by the program FracInt. Although FracInt has a reputation for being a highly optimized program, a careful examination and application of the ideas of Agner's article given above shows that there is room for improvement.
|
|
The before listing is similar to the one which exists in FracInt. Although the after listing has a much longer inner loop, it is much faster, primarily for the following reasons:
Although I did not time this routine specifically, the overall improvement of my test application went up about 40 percent (this is a lower bound on the performance improvement, since my test application has a lot of other overhead.) These techniques are not being used in the current (02/11/98 v19.6) sources of FracInt; my test program can generate Mandelbrot sets 10 times faster than FracInt. (I seem to be doing somewhat better than Fractal Explorer and Xaos as well.)
Mandelbrot demo and benchmark program
There was an in depth discussion of this loop on comp.lang.asm.x86 involving some good submissions by Patrick Dostert, Andrew Howe and Terje Mathisen. Patrick submitted a solution that ran with similar performance to the solution I give above, however Andrew submitted a solution that appeared to be two clocks faster, and Terje one clock faster still. Terje reasoned out that his cycle efficiency was optimal given what the algorithm had to compute. Here is Terje's loop:
; /* Mandelbrot inner loop, optimized Pentium asm. */ ; /* Written by Terje Mathisen 1997. */ static float ftmp; int brot(double x, double y); #pragma aux brot = \ " mov edx,0x40800000 " \ " mov ecx,MI " \ " mov ftmp,edx " \ " fld st(0) " \ " fld st(2) " \ "brotloop: " /* b a x y */ \ " fld st(1) " /* a b a x y */ \ " fmul st,st " /* aa b a x y */ \ " fld st(1) " /* b aa b a x y */ \ " fmul st,st " /* bb aa b a x y */ \ " fld st(1) " /* aa bb aa b a x y */ \ " fadd st,st(5) " /* aa+x bb aa b a x y */ \ " fxch st(4) " /* a bb aa b aa+x x y */ \ " fmulp st(3),st " /* bb aa ab aa+x x y */ \ " fadd st(1),st " /* bb aa+bb ab aa+x x y */ \ " fsubp st(3),st " /* aa+bb ab aa-bb+x x y */ \ " fxch st(1) " /* ab aa+bb aa-bb+x x y */ \ " fadd st,st " /* 2ab aa+bb aa-bb+x x y */ \ " cmp ftmp,edx " /* */ \ " ja brotexit " /* */ \ " fadd st,st(4) " /* 2ab+y aa+bb aa-bb+x x y */ \ " fxch st(1) " /* aa+bb 2ab+y aa-bb+x x y */ \ " fstp ftmp " /* 2ab+y aa-bb+x x y */ \ " dec ecx " /* */ \ " jnz brotloop " /* */ \ " fld st(0) " /* */ \ "brotexit: " /* */ \ " fstp st(0) " \ " fcompp " \ " fcompp " \ parm [8087] [8087] modify [8087 edx] value [ecx]; |
This loop had stood for a reasonable amount of time, and indeed is very well designed. Notice how he has pulled parts of the loop around itself in order to take advantage of more FPU scheduling opportunities. (The FPU stack is shown in the comments with the red entries denoting busy entries.) Indeed this loop has stood long, but ...
Update: Damien Jones has submitted yet another loop that shaves another cycle off of Terje's loop. Fundamentally Damien made the observation that the cmp ftmp,edx instruction takes two clocks because of the memory load of ftmp. So to deal with this he rearranged the loop to decrease the number of flds but introduce a stall between a pair of fmuls, then inserts a mov eax,ftmp between the fmuls (thus making them issue as well as execute two clocks away with the first fmul executing in parallel with the value of ftmp being loaded) then changing the cmp instruction to cmp eax,edx which executes in one clock. This saves a clock overall. Here is Damien's loop:
; /* Mandelbrot inner loop, optimized Pentium asm. */ ; /* Written by Damien Jones 1997, 1998. */ static float ftmp; int brot(double x, double y); #pragma aux brot = \ " mov edx,0x40800000 " \ " mov ftmp,edx " \ " mov ecx,MI " \ " fld st(1) " \ " fld st(1) " \ "brotloop: " \ " fld st(0) " /* a a b r i */ \ " fxch st(2) " /* b a a r i */ \ " fmul st(2),st(0) " /* b a ab r i */ \ " mov eax,ftmp " /* */ \ " fmul st(0),st(0) " /* bb a ab r i */ \ " fxch st(2) " /* ab a bb r i */ \ " fadd st(0),st(0) " /* 2ab a bb r i */ \ " fxch st(1) " /* a 2ab bb r i */ \ " fmul st(0),st(0) " /* aa 2ab bb r i */ \ " fxch st(1) " /* 2ab aa bb r i */ \ " fld st(2) " /* bb 2ab aa bb r i */ \ " fsub st(0),st(4) " /* bb-r 2ab aa bb r i */ \ " fxch st(3) " /* bb 2ab aa bb-r r i */ \ " fadd st(0),st(2) " /* bb+aa 2ab aa bb-r r i */ \ " fxch st(1) " /* 2ab bb+aa aa bb-r r i */ \ " fadd st(0),st(5) " /* 2ab+i bb+aa aa bb-r r i */ \ " fxch st(3) " /* bb-r bb+aa aa 2ab+i r i */ \ " fsubp st(2),st(0) " /* bb+aa aa-bb+r 2ab+i r i */ \ " cmp eax,edx " /* +'ve IEEE float compare */ \ " ja brotexit " /* cond branch */ \ " fstp ftmp " /* aa-bb+r 2ab+i r i */ \ " dec ecx " /* */ \ " jnz brotloop " /* */ \ " fld st(0) " \ "brotexit: " \ " fstp st(0) " \ " fcompp " \ " fcompp " \ parm [8087] [8087] modify [8087 edx] value [ecx]; |
Note that this code has more fxch instructions and as such will perform worse on a 486 or K6 CPU, but overall will perform faster on Pentium, Pentium Pro and Pentium II CPUs.
Well, never say never ... Thomas Jentzsch, believe it or not, found another clock to save. The two loops above have two exit conditions, based on integer flags. However, the dec instruction does not modify the carry flag, and the x86 has numerous "combination flag" branch instructions. Using this insight, he found a way to use only one branch instruction on the inner loop:
; /* Mandelbrot inner loop, optimized Pentium asm. */ ; /* Written by Thomas Jentzsch 1998. */ int brot2(int c); #pragma aux brot2 = \ " mov edx,0x40800000 " \ " mov eax,edx " \ " mov ftmp,edx " \ " fld ftmp " \ "brotloop: " /* R a b r i */ \ " fstp ftmp " /* a b r i */ \ " fld st(0) " /* a a b r i */ \ " fxch st(2) " /* b a a r i */ \ " fmul st(2),st(0) " /* b a ab r i */ \ " cmp edx,eax " /* */ \ " mov eax,ftmp " /* */ \ " fmul st(0),st(0) " /* bb a ab r i */ \ " fxch st(2) " /* ab a bb r i */ \ " fadd st(0),st(0) " /* 2ab a bb r i */ \ " fxch st(1) " /* a 2ab bb r i */ \ " fmul st(0),st(0) " /* aa 2ab bb r i */ \ " fxch st(1) " /* 2ab aa bb r i */ \ " fld st(2) " /* bb 2ab aa bb r i */ \ " fsub st(0),st(4) " /* bb-r 2ab aa bb r i */ \ " fxch st(3) " /* bb 2ab aa bb-r r i */ \ " fadd st(0),st(2) " /* aa+bb 2ab aa bb-r r i */ \ " fxch st(1) " /* 2ab aa+bb aa bb-r r i */ \ " fadd st(0),st(5) " /* 2ab+i aa+bb aa bb-r r i */ \ " fxch st(3) " /* bb-r aa+bb aa 2ab+i r i */ \ " fsubp st(2),st(0) " /* aa+bb aa-bb+r 2ab+i r i */ \ " dec ecx " /* */ \ " ja brotloop " /* */ \ " jnc brotexit " \ " add ecx,2 " \ "brotexit: " \ " fstp st(0) " \ " fcompp " \ " fcompp " \ parm [ecx] modify [8087 edx] value [ecx]; |
This loop is indeed the fastest yet for the Pentium, however it is significantly slower on Pentium II's than Damien's loop. As mentioned in Agner Fog's notes above on the P-II, this is due to a "partial register stall", because it has to combine flag results from more than one output result.
Side note on Intel's sample MMX implementation
I would just like to take this moment to point something out. On Intel's MMX developer example web pages, they have posted some code to demonstrate the advantages of using MMX code for precisely this application. I was flabergasted by what I saw there. I don't believe they do a fair comparison. They have written code that is so bad as to be basically unusable for the purposes of any remotely practical Mandelbrot generation. I can't understand why they have done this. First of all, as is demonstrated by my program, low zooms of Mandelbrot sets are completely uninteresting as they can be computed in less than a couple seconds on even the slowest of Pentium based computers. However, because they choose only 16 bits of precision for their calculations (so that they can do many of them at once using MMX style SIMD) they will not be able to generalize their algorithm for deep zooms (they would lose accuracy very quickly.) The sample image, itself practically betrays the fact that the output quality of their algorithm is quite poor. Second the comparative FPU based code is a complete sham. There is essentially no pipelining or attempt at ordinary Pentium optimization at all. The code above, (even my code) on a gut feel should execute between 2 and 3 times as fast as the code given in Intel's example. Note that they use FSTSW, they have unnecessary resource contentions on FPU registers, and use FST in inappropriate ways while not once using FXCH to maximally leverage pipelining. Given the totally unpipelined way it has been written, it looks like the AMD K6 could beat the Pentium or Pentium-II on that code by as much as 50%! I can only imagine this was shown in this way because it is probably not much worse than your typical C compiler so Intel may have figured that programmers would not notice. Problem is, most programmers that can read the code and actually understand the issues well enough to understand how the code works are more likely to know a thing or too about optimization, and therefore know that the code shown is basically junk. Just to make myself clear I claim that all three routines given by Intel are garbage. The MMX one can't be used for zooms, the FPU one is poorly pipelined and using integer code is just a bad way to try to do it right from the start. In my opinion, Intel scores 3 goose eggs. |
Example 2 |
Continuing with my current obsession with the FPU ... There was a recent USENET posting, where the author was trying to gage the performance of the FPU's multiplication capabilities. After pointing out an obvious flaw in his first attempt (his accumulated multiplies were causing overflow exceptions which significantly slowed the FPU performance), the following C code was arrived at:
#include <stdio.h> int timeGetTime(void); #pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx]; void DoFpMult(void) { int i; float val2 = (float)1.00007; float val1 = (float)1.2; int startTime; startTime = timeGetTime(); for (i = 0; i < 1000000; i++) { val1 *= val2; } printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1); } |
Of course it can be optimized with a simple exp/log calculation, but as I explained above, the point of the exercise was primarily to estimate the performance of the FPU's multiplier.
According to VTUNE and the actual results the floating point version was faster than the similarly written integer (fixed point) code by only about 35%. This result appeared to hold true for both the WATCOM C/C++ compiler and the Visual C++ compiler. After disassembling the code, it was fairly clear why the C compilers were not doing better. Here is the inner loop WATCOM C/C++ gave:
L1: fld st(1) inc eax fmul st,st(1) fstp st(2) cmp eax,1000000 jl L1 |
Although the compiler does seem to realize that it should shuffle integer instructions that dont pair in between FPU instructions (to take advantage of shadowing any incidental stalls that might occur between them,) the problem, of course, is that the FPU stack is being shuffled around unnecessarily. Clearly the ideal inner loop looks like:
L1: fmul st,st(1) dec ecx jne L1 |
Unrolling doesn't help since the throughput of the fmul is 3 clocks anyways. This code can be fused back into the original C code using the WATCOM C/C++ compiler as follows:
#include <stdio.h> int timeGetTime(void); #pragma aux timeGetTime = ".586" "rdtsc" value [eax] modify [edx]; float FPMUL(float v1, float v2, int count); #pragma aux FPMUL = \ " L1: fmul st,st(1) " \ " dec ecx " \ " jnz L1 " \ " fxch " \ " fstp st(0) " \ value [8087] parm [8087] [8087] [ecx]; void DoFpMult(void) { float val2 = (float)1.00007; float val1 = (float)1.2; int startTime; startTime = timeGetTime(); val1 = FPMUL(val1,val2,1000000); printf("Took %d clocks result:%f\n",timeGetTime()-startTime,val1); } |
The mechanism for VC++ works differently, requiring you to handle all the passing from float variables to the FPU stack yourself, however it should be possible to have at least the same inner loop. This loop actually doubles the performance making it run 3 times faster than comparable integer code. While not a Pentium specific optimization, I believe it is important to understand how to use the floating point unit, to use the Pentium most effectively, far more so than previous generation x86s. The lesson here is that you should not blindly trust your C compiler even if they advertise that they have Pentium Optimizations.
The Cyrix 6x86 |
Jorn Nystad has written up some notes that complement the Cyrix documentation available from their site (see Reference Links below.)
Other References |
Please note: I have left off Intel P6/Klamath info, because I cannot predict the URLs at Intel's constantly changing web site. I may change my mind or try to host some P-II (aka Klamath) documentation in the future but don't hold your breath. For now, since the design is so similar, I recommend the K6 documentation, or browsing Intel's site directly for P-II information. |