Notes on Performance

Author: Peter H. Hauschildt

CPU/machine Performance

Performance of various CPUs and operating systems tends to be an emotional issue, often people seem to "believe" in their favorite CPU without thinking very much. The only issue that I consider relevant is "how fast does my code run?". This is actually an issue that tests the CPU/FPU/memory/disk/OS/compiler system as a whole (a fast CPU with horrendously bad compiler is a pretty pointless system, for example). The best way to measure this performance specification is to actually run the code with reasonable input datasets. That is not always feasible, so we've developed a simplified benchmark code that simulates parts of the code (bench.f). The results depend on the actual system (OS and Fortran compiler versions etc). The most current results for the phoenix test benchmark give a rough overview of the performance of several PHOENIX kernels. The numbers given are mega-operations per second, so larger is better.


The different sub-tests of the phoenix NLTE test benchmark are as follows: In general, the suffix "_ser" indicates serial execution of the test (one CPU) and "_par" indicates parallel execution of the test (loops) using openMP directives with 2-4 threads. For machines where openMP is not useful or functional, both results will be "identical". Only the IBMs have currently functional openMP modes for the benchmarks (array reduction variables are used, most openMP implementations do not support them). The individual tests are

The test code uses Fortran90 array syntax (some compilers hate this!) and Fortran90 modules and allocatable arrays (bad compilers get a really nasty performance hit by this!) and is somewhat representative of the actual code used in phoenix. The "Performance Index" is a simple weighted sum of the individual serial performance numbers for 3 different and typical PHOENIX applications.

Unfortunately, the simple benchmark does not accurately reflect the real-life timing behavior of the PHOENIX code. Therefore, better performance metrics are real examples of the full PHOENIX code running different sets of applications. The following sets are currently available:

All number are wall-clock times in seconds (smaller is better), given for different parts of the code. A graphical representation of the important data is in this chart. The machines tested are

The results of these tests are interesting.The fastest machines are the Athlon64 and the G5, in many cases the G5 is substantially faster. This is due to the use of the Altivec unit on the G4/G5 that can deliver massive performance increases with just a few pieces of code. Thex exp() function is fastest on the Athlon64 and the Intel CPUs as long as they do not produce underflows. The Power3 and Power4 systems are lacking behind, however, they are still nearly 5 times faster for I/O (this does not show up in these tests). The performance of the Itanium2 is a total mystery to me, the result of the bench.f test for the Itanium2 is extremely good but the performance of the PHOENIX runs is extremely poor. Furthermore, some tests could not run (crashed). To get optimal performance out of a G5 you must use the Altivec unit in the most time consuming parts of the code, the speed-ups are staggering. The Athlon64/Opterons do not need special coding and have overall very good performance.


Compilers

Compilers are often overlooked as being crucial for performance of a code, this view has also to include libraries and I/O subsystems. They are also important to assist locating bugs in the code. I found that getting a code to "run" is pretty much trivial compared to getting it to run correctly, both semantically and numerically. Usually I test code on a number of different systems (CPUs, compiler, OSs etc) and this always detects more bugs than a single system can possibly find. A "mono-culture" of using only one type of machine/compiler is a recipe for disaster. Here are some subjective comments of different machines/compiler combinations my group uses regularly, roughly ordered by my personal preference: