CPU/machine Performance

Performance of various CPUs and operating systems tends to be an emotional issue, often people seem to "believe" in their favorite CPU without thinking very much. The only issue that I consider relevant is "how fast does my code run?". This is actually an issue that tests the CPU/FPU/memory/disk/OS/compiler system as a whole (a fast CPU with horrendously bad compiler is a pretty pointless system, for example). The best way to measure this performance specification is to actually run the code with reasonable input datasets. That is not always feasible, so we've developed a simplified benchmark code that simulates parts of the code (bench.f). The results depend on the actual system (OS and Fortran compiler versions etc). The most current results for the phoenix test benchmark give a rough overview of the performance of several PHOENIX kernels. The numbers given are mega-operations per second, so larger is better.

The different sub-tests of the phoenix NLTE test benchmark are as follows: In general, the suffix "_ser" indicates serial execution of the test (one CPU) and "_par" indicates parallel execution of the test (loops) using openMP directives with 2-4 threads. For machines where openMP is not useful or functional, both results will be "identical". Only the IBMs have currently functional openMP modes for the benchmarks (array reduction variables are used, most openMP implementations do not support them). The individual tests are
  • ratupd: loop to simulate updating radiative rates. Tests memory bandwidth
  • cnt_opac: simulates computation of NLTE continuous opacities
  • voigt_opac: simulates computation of Voigt profiles for NLTE lines. Very complex loop (complex arithmetic, exp's, divisions)
  • gauss_opac: simulates computation of Gauss profiles for NLTE lines. Most memory bandwidth and exp's.
The test code uses Fortran90 array syntax (some compilers hate this!) and Fortran90 modules and allocatable arrays (bad compilers get a really nasty performance hit by this!) and is somewhat representative of the actual code used in phoenix. The "Performance Index" is a simple weighted sum of the individual serial performance numbers for 3 different and typical PHOENIX applications.
Unfortunately, the simple benchmark does not accurately reflect the real-life timing behavior of the PHOENIX code. Therefore, better performance metrics are real examples of the full PHOENIX code running different sets of applications. The following sets are currently available:
All number are wall-clock times in seconds (smaller is better), given for different parts of the code. A graphical representation of the important data is in this chart. The machines tested are
  • Athlon64: at 2GHz clockspeed, running SuSe Linux 9.0, compiled with ifc 7
  • G4_667_PB: Apple PowerBook G4 at 667MHz, 1GB RAM, running OSX 10.3 Server, compiled with IBM xlf95 compiler
  • dG5: Apple dual G5 2GHz, 2.5GB RAM, running OSX 10.3 Server, compiled with IBM xlf95 compiler
  • Itanium2_1.3GHz: Altix 32 CPU Itanium2 at 1.3GHz, running Linux, compiled with Intel Fortran compiler v8
  • P4_2.53GHz: Intel P4 at 2.53GHz, running FreeBSD, compiled with Intel Fortran v7 (FreeBSD native executable)
  • P4_2.6GHz: Intel P4 at 2.6GHz, running Linux 2.6.1, compiled with Intel Fortran v8
  • PWR3: IBM SP nighthawk-II Power3 System, AIX 5.1, compiled with xlf95
  • PWR4: IBM Regatta Power4 System, AIX 5.1, compiled with xlf95
The results of these tests are interesting.The fastest machines are the Athlon64 and the G5, in many cases the G5 is substantially faster. This is due to the use of the Altivec unit on the G4/G5 that can deliver massive performance increases with just a few pieces of code. Thex exp() function is fastest on the Athlon64 and the Intel CPUs as long as they do not produce underflows. The Power3 and Power4 systems are lacking behind, however, they are still nearly 5 times faster for I/O (this does not show up in these tests). The performance of the Itanium2 is a total mystery to me, the result of the bench.f test for the Itanium2 is extremely good but the performance of the PHOENIX runs is extremely poor. Furthermore, some tests could not run (crashed). To get optimal performance out of a G5 you must use the Altivec unit in the most time consuming parts of the code, the speed-ups are staggering. The Athlon64/Opterons do not need special coding and have overall very good performance.

Compilers

Compilers are often overlooked as being crucial for performance of a code, this view has also to include libraries and I/O subsystems. They are also important to assist locating bugs in the code. I found that getting a code to "run" is pretty much trivial compared to getting it to run correctly, both semantically and numerically. Usually I test code on a number of different systems (CPUs, compiler, OSs etc) and this always detects more bugs than a single system can possibly find. A "mono-culture" of using only one type of machine/compiler is a recipe for disaster. Here are some subjective comments of different machines/compiler combinations my group uses regularly, roughly ordered by my personal preference:
  • Apple G5's running MacOS X and xlf95. The currently overall fastest combination. The G5 is a great system, MacOS X is basically a BSD Unix with a nice looking frontend. IBMs xlf95 produces the fastest code, can run the code in debug mode (including catching arithmetic problems like divisions by zero etc!) and the VAC C++ compiler is excellent too. The also available NAG compiler works as well on the Apple's as it does on other architectures, it is highly recommended for debugging and code development. The Absoft compiler runs fine on this system but has a few remaining problems. The PowerPC G5 is essentially a faster Power4 CPU, it is also a BigEndian system which makes transferring binary files from supercomputers trivial. I currently use a PowerBook G4 (Titanium) and a dual G5 as my main development and testing machines (they can handle the 10GB input file easily!).
  • Athlon64/Opteron: Very good overall performance, for some model types faster than the G5 and overall very good price/performance ratio. Use Intels ifc (yes, it works on AMD64's) on those to get high-performance production code, use NAG for development and testing. Warning: Intel's C++ compiler cannot compile the QD library with anything above -O0 correctly, that puts all Intel based machines at a severe disadvantage compare to PPCs or Power systems. g++ does compile the QD code correctly, however it currently produces not quite so fast executables and it seems to be impossible to link g++ code with Intel ifort (version 8) code, that is a massive performance problem for all IA-32's, IA-64s and AMD64s.
  • IBM RS/6000 (or SP) machines with IBM xlf95: Excellent performance and very good compiler. xlf95 is very sticky about Fortran95 syntax (this is a good thing, you don't want some crappy compiler that accepts every weirdo and illegal extension to the standard!). It offers excellent compile-time checks on arrays, commons, subroutine calls etc (keep your system patched, in particular xlf and ld, to be able to use all these features). It's run-time debug options are not functioning (ironically, the OSX version of xlf95 is excellent for debugging!). xlf95 seems to have only a small number of bugs, all of them appear to be minor.
  • NAG 95 on any platform: The NAG f95 compiler is excellent. Very good syntax checking (it is picky about the standard!), very good static and run-time debugging, very little problems. If used with an optimized version of gcc it produces code that is as fast as any IA-32 Fortran compiler (the urban legend that a Fortran-2-C style compiler produces slow code is just nonsense). I personally prefer the NAG compiler over the Intel ifc and the Absoft f95 for IA-32 or the Portland Goup f90 compiler (it cannot compile f95 code and has a number of show-stopping compiler bugs even for f90 code) for IA-32. Intel's ifc version 7.0 produces very fast code and delivers the correct numbers, even it's 'read Big Endian binary files' mode works well with the EOS tables. However, its debug doesn't work too well (in fact, it detects spurious errors, some debug options crash the compiler) though. Beware of different versions, it seems that many patches result in an unusable compiler. Intel's compiler works under Linux, NAG's under Linux and the much better FreeBSD. I do not have much experience with Absoft f95 for IA-32 (see below though). The best overall IA-32 CPU is a complex issue, for phoenix runs Athlon's outperform Pentiums IIIs by factors of 2 or better at the same clockspeed when the NAG f95 compiler is used. If you use Intel's ifc, you get MUCH better performance from the Intel chips, executables produced with ifc will easily outrun the Athlon's. Therefore, I'd recommend using NAG f95 to debug the code and ifc to produce fast executables. The Pentium/ifc combination is significantly faster than the Athlons. As operating system for IA-32 boxes we use FreeBSD as it offers native 64bit filesystems and good NFS3 support, Linux has fallen behind in the technology curve in these respects.
  • SGI Origin 2000/3000 systems: This is a nice all around system. The compiler has some trouble delivering performance, the MIPS CPUs are certainly fast enough. We use these machines mainly as MPI workhorses if no SP's are available.
  • DEC Alpha based systems: Fast CPU. Crappy compiler. LittleEndian. 'nuff said.
  • HPUX systems with PA-RISC processors: Slow CPUs (I have not tried more recent versions, though), slow compiler (it takes 8h to compile phoenix), very bad at handling f90 array syntax, modules and allocatable arrays. I used to have 2 ancient HPs as testbeds and "seats", they are finally gone (lasted close to 8 years with no failure, eventually the disks gave out and that was it). HPUX 11 and its compilers appear to be no improvement compared to HPUX 10.
  • Sun UltraSPARC machines: Slow CPU, very bad compiler (it happily compiled illegal Fortran90 in some versions!). Sun changes the bit-layout on structure etc between versions, causing major grief. Nice for code testing once it compiles and links, too slow for production
  • Itanium2: The benchmark results are very promising, however, the overall code performance on this CPU is miserable (in particular regarding its staggering cost). With Intels ifort, it is currently not possible to get working code by linking with g++ generated code, therefore, the NLTE modes of PHOENIX will either not run or extremely slowly (due to the problems with the QD library). Furthermore, the standard
  • Do not ask me about Windows.