The benchmarks ran pretty much out of the box, with minor changes to Compilers/Linux-ifort.mk, most notably some paths and the ifort compiler (ifort 10.0.013 beta, the latest) options, for which I used:
FFLAGS += -ip -O3 -unroll0 -ftz -fno-alias -g
Note that using -g with Intel Fortran doesn't affect any of the optimizations, and serves only to keep extra symbol table information in the executable, which is excellent for debugging and profiling purposes.
The MPI version has clearly improved when compared against ROMS 2.2, and is now uniformly faster than the OpenMP version. The SGI MPI implementation (over shared memory) is pretty fast. Curiously, the MPI version does better with NtileI > NtileJ, in contrast to the OpenMP version, where NtileI = 2 does best in most test cases.
Here are the wallclock times in secods for the various runs I made:
Code: Select all
Benchmark1: 512x64x30
Decomp MPI elapsed OpenMP elapsed
1x32 29.35 34.90
2x16 21.84 27.94
4x8 18.65 28.84
32x1 16.29 100.22
8x4 16.16 43.61
16x2 15.14 56.53
Benchmark2: 1024x128x30
Decomp MPI elapsed OpenMP elapsed
1x32 91.39 90.28
2x16 70.96 78.21
4x8 63.70 79.56
32x1 58.57 277.21
8x4 57.73 123.67
16x2 54.94 178.90
Benchmark3: 2048x256x30
Decomp MPI elapsed OpenMP elapsed
2x32 157.53 178.60
4x16 135.70 171.18
2x64 100.33 118.23
8x8 120.45 263.14
32x2 116.24 474.47
16x4 115.92 382.44
4x32 81.47 103.86
8x16 70.91 138.24
16x8 66.02 197.59
64x2 64.69 387.64
32x4 61.06 250.06
6x32 58.29 117.84
8x24 55.63 117.85
12x16 52.35 138.27
16x12 52.23 152.11
64x3 51.32 314.30
48x4 50.27 250.98
24x8 49.81 168.98
32x6 49.46 200.25