Benchmarks

Post ROMS benchmark results

Moderators: arango, robertson

Post Reply
Message
Author
cbjerrum
Posts: 6
Joined: Thu Jul 10, 2003 10:34 am
Location: University of Copenhagen
Contact:

Benchmarks

#1 Unread post by cbjerrum »

Dear All

Some time back (before the Forum) there was a some discussion about benchmarks for ROMS on different hardware architectures.

Is there currently any one willing to share their benchmarks with the user group?

Before I apply for computing time I would like to know what is the most optimal setup etc.

Sincerely

Christian Bjerrum

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

#2 Unread post by arango »

ROMS/TOMS has an idealized Southern Ocean benchmark. The model is configured in spherical coordinates with realistic CPP options for horizontal and vertical mixing, forcing, and air-sea interface. No input files are needed since all the initial and forcing fields are set-up with analytical expressions in analytical.F. These bechmarks can be activated by turning on either BENCHMARK1, BENCHMARK2, or BENCHMARK3 options in cppdefs.h. The model is only run for 200 time-steps. The output is also turned off to avoid IO issues. It is recommended to run each benchmark several times and average the elapsed time. The timings can change according to the computer load. Try different parallel partitions and when possible run these benchmarks in both shared- and distributed-memory computers.

Notice that each benchmark is of increased size: 512x64x30, 1024x128x30, and 2048x256x30. All the benchmarks can be run with the same input script ocean_bench.in whith a time-step small enough to run the finer resolution application. The size of the time-step is irrelevant here. We just need to run the benchmarks for 200 time-steps.

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

#3 Unread post by wilkin »

The following timings are for running ROMS benchmark on the Penguin demo 4-node dual opteron cluster using the PGI compiler with native (optimized) MPI library. The tabulated values are total CPU time summed over all processors.

Tile partition indicate the horizontal domain subdivision used. Data in each tile is sent to a separate process by MPI. Synchronization of the solution at the tile boundary halo region is achieved by MPI. The use of more parallel nodes decreased the computaional load, but increased the inter-communication. Wide/long tiles (a split of 1x8) have longer vectors in the inner Fortran loop that do squareish tiles (a split of 2x4). But the total number of data points in the halo region is larger for 1x8 versus 2x4.

The three benchmarks are for the same problem and number of time-steps. However, they simply have increasing domain size in terms of number of spatial grid points. The CPU effort scales almost in direct proportion to the number of grid points because there are no iterative/convergence type operations. Therefore, BENCHMARK3 should take close to 4 times as long as BENCHMARK2 because there are 4 times as many arithmetic operations to compute.

In practice, we see BENCHMARK2 takes about 3.5 times as long, indicating there are no obvious performance bottlenecks associated with running this large a problem on these processors. Good news. In fact, the scaling is pretty much linear all around.

The test system was 4.2 times faster than our existing 4-processor Sun machine (bildad) in terms of total CPU time, presumably due solely to processor speed. In wall clock time (when using all 4-nodes/8-processors) the test system was 7.3 times faster than bildad (4-processors) on the same problem.

Runs number 4 and higher use version 5.2 of the pgf90 compiler.

Code: Select all

            Benchmark1       |       Benchmark2      |    Benchmark3
          512 x 64 grid      |    1024 x 128 grid    |  2048 x 256 grid
        _______________________________________________________________
                             |  
tiles:  1x8   2x4  1x4  1x2  |  1x8     2x4     1x4  |  1x8     2x4
                             |
CPU seconds
-----------
run# (multiple runs of the exact same problem) 

1       2271 2332 1983       |  9585    8965    7190   34903  33594
2       2272 2511 1906       |  9544    8920           33573  33799
3       2262 2456 2089       |  9543    8818           33755
mean    2268 2433 1993       |  9557    8901    7190   34077  33696
4       1989            1408                           30096
5                      

[Bildad]*        [8300]


Wallclock seconds
-----------------

1       285  292  496           1199    1121    1739    4365   4202
2       284  315  476           1194    1116            4199   4227
3       283  307  522           1194    1103            4222   
mean    284  305  498           1196    1113    1739    4262   4215
4       249             705                             3764
5

[Bildad]*       [2077]

* Bildad is our existing 4-processor Sun

cbjerrum
Posts: 6
Joined: Thu Jul 10, 2003 10:34 am
Location: University of Copenhagen
Contact:

Benchmark - opteron specs.

#4 Unread post by cbjerrum »

Just for reference, what eth interconnect was used and what was the CPU speed etc.?
wilkin wrote:The following timings are for running ROMS benchmark on the Penguin demo 4-node dual opteron cluster using the PGI compiler with native (optimized) MPI library. The tabulated values are total CPU time summed over all processors.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#5 Unread post by m.hadfield »

The following URL (only temporary, I'm afraid) gives some timings

http://www.myroms.org/links/benchmarks.htm (now permanent)

The UPWELLING case is the ROMS 1.x & 2.x upwelling test case, with no biology or dagnostics, run for 72 time steps. IO is not turned off, but is neglible if the output files are on a local disk.

The BENCHMARK1 and 2 cases are the ones referred to by Hernan, but run for only 20 time steps--this is plenty long enough to make startup time negligible, at least on the platforms I have used.

I've never run BENCHMARK3, but it would probably fit on our T3E. I must give it a try...

rsignell
Posts: 124
Joined: Fri Apr 25, 2003 9:22 pm
Location: USGS

More BENCHMARK1 Results for various Linux & Windows

#6 Unread post by rsignell »

Here's what I've got so far for the ROMS "BENCHMARK1" test: (ROMS/TOMS 2.1 - Benchmark Test, Idealized Southern Ocean Resolution, Grid 01: 0512x0064x030): All time are "wall clock" on otherwise idle systems. The tiling was 1x2, 1x4, 2x4, 4x4 for the 2, 4, 8 and 16 processor runs, respectively.


16 cpu, 8 node, 1.5GHz/1.5MB cache Itanium-2 Altix (OpenMP, Linux) 1.45 minutes
8 cpu, 8 node, Scott Doney's 2.8 GHz Xeon cluster (MPI/MPICH, Linux,Myrinet) 2.37 minutes
8 cpu, 8 node, Rob Hetland's 2.8 GHz P4 cluster (MPI/LAM, Linux,Gigabit) 3.95 minutes
8 cpu, 4 node, Scott Doney's 2.8 GHz Xeon cluster (MPI/MPICH, Linux,Myrinet) 4.38 minutes

4 cpu, 4 node, Rob Hetland's 2.8 GHz P4 cluster (MPI/LAM, Linux,Gigabit) 7.43 minutes
4 cpu, 1 node, USGS Alpha ES40 (MPI/HP, Tru64) 9.83 minutes
4 cpu, 1 node, USGS Alpha ES40 (OpenMP, Tru64) 9.53 minutes
4 cpu, 1 node, ERDC Alpha SC40 (MPI/HP, Tru64) 9.83 minutes

2 cpu, 1 node, Sandro Carniel's Dual 1.3 GHz Itanium II (OpenMP, Linux) 10.82 minutes
2 cpu, 2 node, Rob Hetland's 2.8 GHz P4 cluster (MPI/LAM, Linux,Gigabit) 12.53 minutes
2 cpu, 1 node, John Warner's Dual 3.0 GHz Xeon (OpenMP, Cygwin) 16.56 minutes
2 cpu, 1 node, USGS Alpha ES40: (OpenMP, Tru64) 17.68 minutes
2 cpu, 1 node, Rich Signell's Dual 3.0 GHz Xeon (OpenMP, Linux) 8.53 minutes

An interesting point -- it would appear that Rob Hetlands 8 cpu 2.8 Ghz P4 cluster beats John Wilkin's 8 cpu Penguin Opteron cluster. What does this mean? John Wilkin, did you ever say whether your penguin tests were with Gigabit, Myrinet, or Infiniband?

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#7 Unread post by m.hadfield »

Here's what I've got so far for the ROMS "BENCHMARK1" test:
(ROMS/TOMS 2.1 - Benchmark Test, Idealized Southern Ocean
Resolution, Grid 01: 0512x0064x030): All time are "wall clock" on
otherwise idle systems. The tiling was 1x2, 1x4, 2x4, 4x4 for
the 2, 4, 8 and 16 processor runs, respectively.
The BENCHMARK1 domain is much larger from west to east than from south to north. Has anyone tried tilings that give squarer tiles, like 4x1, 4x2, 8x2?

Our Cray T3E works well with tile sizes down to 30x30 or slightly less, so should cope with tilings up to about 16x2 on this problem. I'll give it a try...

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

#8 Unread post by shchepet »

I glanced the whole thread of conversation and found it quite disapointing, in sense that there is virtually no attempt to make a meaningful interpretation of the results.

Because of split-explicit time stepping in ROMS code (all of them), most MPI-related latencies occur in 2D mode, where many small messages are sent, while 3D code has very different properties. I remember back in fall 1998 Davis meeting Enrique and Mohammed were reporting S-shaped scaling curves for SEOM code. ROMS, to my experience, has similar properties with two "optima" on the performance vs. number of CPUs scaling curve. They are related to 2D and 3D cache utilization vs. message overhead balance.

Virtually all results reported here are very uncertain in the sense that sub-optimal use of code is most likely to be the case, for example, excerpt from Rich Signells message

2 cpu, 1 node, Rich Signell's Dual 3.0 GHz Xeon (OpenMP, Linux) 8.53 minutes

implies that tiling was 1x2, as stated in the beginning of his message. (Am I interpreting it correctly?) What about rerunning this test using 4x16 partition, still using 2 CPUs and possibly reducing the time reported to something like 5.5 or 6 minutes?

Separate benchmarking of 2D codes would be informational.

Compilers used in the test were not reported in most benchmarks.

Also it should be noted that the easiest way to achive good scaling of MPI code is to deplete its "internal" code performance, for example by turning off compiler optimization flags: then it scales really great. A reference point should be established and reported for every MPI test case, say running the problem using 1 CPU and optimal partition to properly utilize its cache, and using full compiler optimization and the best compiler available.

John Wilkin: in running the code on Opteron do you use 32 or 64-bot mode operating system? Compiler? (to my experience, PGF is the best for 64-bit on Opteron, while IFC still beats it in 32-bits mode on Operon.)

rsignell
Posts: 124
Joined: Fri Apr 25, 2003 9:22 pm
Location: USGS

What is the best F90 compiler? Check this out!

#9 Unread post by rsignell »

As Sasha pointed out before, there is an amazing difference between F90 compilers, and picking the “best compiler” can depend on the problem you are trying to run. Proof of this is at the Polyhedron 2004 F90 Benchmarks site, where they show difference in performance on 10 different F90 programs depending on compilers. They test 6 compilers on Intel/Windows, 5 compilers on Intel/Linux and 9 different compilers on AMD/Linux (64 bit Opteron):

Polyhedron 2004 F90 Benchmark Results:
Intel/Windows: http://www.polyhedron.com/compare/win32 ... ch_p4.html
Intel/Linux: http://www.polyhedron.com/compare/linux ... ch_p4.html
AMD/Linux: http://www.polyhedron.com/compare/linux ... h_AMD.html

As an example of the variation in performance, for Intel/Linux on the “CAPACITA” test, the PGI compiler was 85% faster than the Intel compiler, yet for the “FATIGUE” test, the Intel compiler was 180% faster than PGI!

On the basis of the Geometric Mean for the 10 tests, the Intel compiler was the fastest for both Intel/Windows and Intel/Linux, and was only 7% slower than the fastest Pathscale 64 bit compiler on AMD/Linux. Not bad, considering this is free-for-non-commercial-use software (at least on Linux). Pathscale (64) was the most consistent compiler for AMD/Linux, being the only compiler to run all 10 tests within 50% of the fastest mark posted by any compiler. The Intel compiler won this “consistency” honor on Intel/Windows, and there was *no* compiler that had this characteristic on Intel/Linux (all compilers on Intel/Linux had at least one test that took 150% more time than the fastest).

If you are curious how your own F90 compiler measures up, you can download the benchmark suite and run them yourself!

-Rich Signell

rsignell
Posts: 124
Joined: Fri Apr 25, 2003 9:22 pm
Location: USGS

Benchmarking update & expanded info

#10 Unread post by rsignell »

To follow up the post I made last month, I've uploaded a updated spreadsheet of benchmark data for BENCHMARK1 and also our ADRIA02 Adriatic Sea run (with non-parallel I/O) to:

http://cove.whoi.edu/~rsignell/roms/bench/

Additions to this list are an 8-way Opteron Server and a Dual-Opteron cluster with Infiniband interconnectivity. The Dual-Opteron system with Infiniband was very speedy, winning all the BENCHMARK1 tests as well as our Adriatic Sea tests. For example:
BENCHMARK1 (2x4 tile) run times for 8 CPUs:
Dual-Opteron 250 cluster with InfiniBand: 1.56 mins
Dual-Xeon 2.8GHz cluster with Myrinet : 2.35 mins
Opteron 850 8-way Server : 2.46 mins
Dual-Opteron 250 with Gigabit : 4.73 mins
P4 2.8GHz cluster with Gigabit : 5.08 mins
I think it's very interesting that the dual-Opteron 250 cluster with Infiniband was roughly 3 times faster than the dual-Opteron 250 cluster with Gigabit running this benchmark.

As Sasha pointed out, the benchmark speed depends on many factors, such as compiler used, tiling of the domain decomposition, compiler switches, type of MPI, etc. For MPI we are limited to specifying a number of tiles equal to the number of CPUs, so at least we have a less degrees of freedom there. I've tried to show as many of these factors as possible on the spreadsheet available at the URL above. Sasha will probably still not be happy, but I tried...

-Rich

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

#11 Unread post by arango »

Thank you very much Rich for this comprehensive benchmark tabulation and all the additional information provided for each run. We just got a Dual-Opteron 250 Cluster from Peguin with Gigabit interconnectivity. We are currently installing PathScale and the PGI and Intel compilers. The InfiniBand was more expensive on the order of thousands of dollars and required a special card (~$500) in each of the CPUs.

Our experience with the Opterons is that they can handle the much larger benchmarks: BENCHMARK2 and BENCHMARK3.

jpringle
Posts: 108
Joined: Sun Jul 27, 2003 6:49 pm
Location: UNH, USA

dual core benchmarks

#12 Unread post by jpringle »

Has anyone used the standard benchmark cases (for examples, those in Rich Signell's benchmarks above) to exam the speed of the dual core operteron processors that AMD has recently released?

For the problem I have in mind, I plan to be running on a 5 or 6 node linux cluster, with each node haveing two processors -- essentially the Microway dual processor opteron cluster in Rich's benchmarks. My grid will be about 250x300x20, or so.

I was wondering if anyone has benchmarked such a cluster with dual processors each with dual cores? If not, I shall probably do so, if I can, in a fashion as similar as possible to Rich's benchmarks. However, the results might be sub-optimal. It can take me a while to learn how to optimally configure a model such as ROMS for new architectures.

The dual core chips are very pricey -- if they are going to be entirely memory bound in their packages, they might well not be worth purchasing.

Thanks for any help yall can share,
Jamie Pringle

jpringle
Posts: 108
Joined: Sun Jul 27, 2003 6:49 pm
Location: UNH, USA

Dual core/dual processor Opteron 275 benchmarks

#13 Unread post by jpringle »

Hi all--

Courtesy of Western Scientific Inc, I benchmarked ROMS 2.2 on a dual
processor, dual core Opteron 275 rig. These chips have two
processing cores per chip, so a workstation with two processor chips
will appear to Linux to have four processors.

The rig I tested had the following specs:

1) 8GB of RAM PC3200
2) 2 x Opteron 275 (2.2GHz) Dual Core Processors
3) 2 x 36GB SCSI 10k rpm 8MB Hard Drives (RAID 1)
4) Red Hat Enterprise Linux ES v4

I compiled the code with Intel FORTRAN version 8.1, with

-static -openmp -ip -O3 -pc80 -xW

(static so that I could run the code on a machine with no libraries
or compiler). The model was configured according to the problem I
am working on, with a 256 x 182 x 30 grid, and no IO to speak of. I
played with tiling to give optimal performance, which I achieved
with a 2x12 tiling (also chosen so that the total number of tiles
are divisible by all numbers between 1-4). NDTFAST was 120 in my
relatively deep domain.

Note that because I used openmp, I could have multiple tiles per
processor. If MPI is used, and there is but one tile per processor,
I would expect performance to be somewhat worse.

Results: I ran the code with 1,2,3 and 4 threads. The scheduler
seemed to keep the code on separate processors when there were 2
threads. "Timestep" refers to a single baroclinic timestep, and the
associated barotropic timesteps. "Thread" refers to the number of
active threads, and thus, roughly, the number of processor cores
utilized. "Time" is the total wall-clock run-time of my benchmark run.

Code: Select all

       #Threads      Time/(timestep)   timestep/time/Thread
          1               10.8                0.092
          2                6.4                0.078
          3                4.8                0.069
          4                4.5                0.055

       On a perfect     1/Threads            Constant.
       machine
Thus for this 2.2 Ghz chip, a dual core machine should be about 40%
faster than the comparable single core machine. However, right now,
a dual core/dual processor 2.2 Ghz system is about 40% more
expensive than a single core/dual processor compute node with a
Opteron 250 (a faster 2.4Ghz chip).

Thus it would not seem that the dual core systems are cost effective
for my application at this time. Your mileage may vary.

Cheers,
Jamie

p.s. whoever invented AMD's numbering scheme for its chips should be
spanked.

p.p.s. I miss the Cray YMP, with its well balanced memory
subsystems.

jpringle
Posts: 108
Joined: Sun Jul 27, 2003 6:49 pm
Location: UNH, USA

infiniband vrs. gigabit

#14 Unread post by jpringle »

Hi all--

I wondered if the varying results people get when they try to
understand the relative importance of the interconnects could have to
do with the relative importance of the 2D timestepping (lots of little
messages) vrs. the 3D timesteps (fewer, larger messages). So I ran
the benchmark1 test with a variety of different values of NDTFAST and
tile arrangements for a new cluster I have from Microway. Since I run
high resolution coastal models in a west coast domain with a narrow
shelf, I tend to have large values of NDTFAST.

I made the tests using 8 processors on 4 dual processor machines as a
function of tile arrangement and NDTFAST, and you can see the results
are broadly consistent with what I said before, but the differences
between gigabit and infiniband are larger. It should be noted that
these results are rather different from what one gets if you run the
same test on eight machines, which for my D-link gigabit switch
improves the relative performance of gigabit ethernet compared to
infiniband.

* ifort version 9.0
* -O3 optimization
* roms2.2 patched through August 11
* benchmark1
* mpich-1.2.7 for rsh, mvapich for infiniband
* version 1.8 of IB gold drivers
* dual opteron 250's
* D-link 24 port managed switch (DGS-1224T)
* Microway 4x Infiniband 24 Port Switch with 4x interconnect
speed, useing a Mellanox chipset.


The results are given in the number of seconds it takes to run the
benchmark1 case. 8x1 indicates NtileI=8, NtileJ=1, etc, and ratio is
the ratio between the time with gigabit ethernet and the time with
infiniband.

<PRE>
interconect Ndtfast 8x1 4x2 2x4 1x8

gigabit 20 125 129 165 232
infiniband 20 75 75 84 92
ratio 1.67 1.72 1.96 2.52

gigabit 100 270 280 360 471
infiniband 100 146 146 166 181
ratio 1.85 1.92 2.17 2.60
</PRE>

Because the domain is 512x64, 8x1 gives the most balanced tile
size, and the best performance for both interconnects (for 8x1, the
tile size is 64 by 64, so their are 128 edge points. For a 1x8 tile,
the tile size is 512x8, for 520 edge points, and thus that much more
internode communication). As you increase the relative amount of
communication by either by increasing NDTFAST so you have more 2D
model timesteps, or by increasing the edge length of the tiles,
infiniband gives you relatively better performance.

Separately, in a relatively few runs, I found that if you run with
bigger tiles (i.e. on fewer processors or with bigger domains), the
advantage from infiniband decreases, as one would expect.

Conclusions

* infiniband is 1.7 to 2 times faster, depending on NDTFAST, for a
model domain of my size.
* the optimal tile size is the same for both kinds of interconnect.

Nothing really exciting about all this, but I hope yall find it helpful

Cheers,
Jamie

mizuta
Posts: 8
Joined: Tue Jul 15, 2003 9:01 pm
Location: Hokkaido University

old compiler problem

#15 Unread post by mizuta »

Hi.

I remember what Mark Hadfield wrote long time ago related with benchmark tests. ROMS2.x can be much slower than 1.8 on some "old" compilers due to implicit-shape of dummy arguments in 2.x. I am still encountering this problem and using 1.8 for my research. The compiler that I am using now (HITACH's compiler; made for HITACHI's computer; only this compiler works on the computer I am using) does not look very old, though it may not be widely used. Is there any way to improve this situation e.g. explicit declarations of array size?
Thank you.

Genta
Last edited by mizuta on Sat Oct 15, 2005 5:11 am, edited 1 time in total.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#16 Unread post by m.hadfield »

ROMS 2.x provides a choice between assumed-shape and explicit-shape array declarations. This is controlled with the ASSUMED_SHAPE preprocessor macro, which is set in globaldefs.h. In versions 2.0 and 2.1, ASSUMED_SHAPE is defined by default; in version 2.2 it depends on the platform.

It would be helpful if you could try your code with and without ASSUMED_SHAPE and report the results on this forum.

mizuta
Posts: 8
Joined: Tue Jul 15, 2003 9:01 pm
Location: Hokkaido University

#17 Unread post by mizuta »

Thank you Mark. Now I have totally understood what you wrote on Nov 3, 2004.

I changed "define ASSUMED_SHAPE" in globaldefs.h to "undef ASSUMED_SHAPE".
However, ROMS slowed down more...

I compared ROMS1.8 and 2.1 using UPWELLING example. Since cppdefs.h of this example is not exactly the same between 1.8 and 2.1, I used UV_VIS2, MIX_S_UV, TS_DIF2, MIX_S_TS, and DJ_GRADPS for dissipation and pressure gradient for both versions. I also set BIO_FASHAM, DIAGNOSTICS_TS, and DIAGNOSTICS_UV in 2.1 to undef. Number of tiles is 1x1, and number of grids and time steps are the same in both versions.

HITACHI's compiler (on HITACH's SR8000)

version of ROMS,. . . . . . . Total CPU time (s), Max. memory used (KB).
-------------------------------------------------------------------------
1.8. . . . . . . . . . . . . . . . . . . . . . .104.9. . . . . . . . 35512
2.1; define ASSUMED_SHAPE. . . 246.7. . . . . . . . 36448
2.1; undef ASSUMED_SHAPE. . . 757.5. . . . . . . . 70100
-------------------------------------------------------------------------

ROMS 2.1 is about 2.5 times slower than 1.8 when ASSUMED_SHAPE is defined (default), whereas 7.5 times slower when ASSUMED_SHAPE is undef.
It also seems that used memory size (I just borrowed this number from outputs of my computer) may be slightly larger than that expected from
grid numbers (41x80x16 causing a 3D double precision array size about 420 KB; NAT=2 for tracers.).

I also tried another compiler on another machine, and verified that the difference between 1.8 and 2.1 of my UPWELLING cases can be much smaller on some compilers. (To save time, total number of time steps was reduced from that in the above example. )

Fujitsu Fortran (on a Linux machine; AMD-A-1.2GHz CPU)

version of ROMS, . . . . . . . . Total CPU time (s)
-------------------------------------------------------------------------
1.8. . . . . . . . . . . . . . . . . . . . . . .151.9
2.1; define ASSUMED_SHAPE. . . 192.8
2.1; undef ASSUMED_SHAPE. . .1105.8
-------------------------------------------------------------------------

So I do not think that I used quite different configurations between 1.8 and 2.1.

Are there any other things that may improve this situation?
Thank you.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#18 Unread post by m.hadfield »

Hi Mizuta. Those are intriguing results. I can't explain all the differences, but I have a few comments (rather long ones as it turns out)...

The first is that the UPWELLING case might not be the best one to use for benchmarking, unless it closely matches the cases you actually want to run. This is because the domain is rather small and so, consequently, is the amount of data to be transferred between CPU and RAM on every time step. You might get more relevant results with the BENCHMARK series: of these the smallest, BENCHMARK1, is the only one I can run on serial machines and is the one I tend to look at.

The substantial increase in execution time you get for the UPWELLING case between version 1.8 and 2.x is surprising. I got a 30% decrease, though I didn't match the CPP definitions very carefully (but I *did* turn off the Fasham biology in version 2). But, as I said, I don't think UPWELLING is the best benchmark.

In my opinion, it is to be expected that turning ASSUMED_SHAPE on and off will make a modest difference in execution time, but a large difference indicates there is something suspect about the Fortran compiler. The ASSUMED_SHAPE switch affects only the *style* of the array declarations and there is no fundamental reason why it should affect the way a program executes. The best explanation I can think of for the differences you see is that with explicit-shape declarations (ASSUMED_SHAPE undefine) the compiler is unable to see that it can pass array subsections by reference, so it is making temporary copies and passing them by value, with a further copy on output for intent(out) variables.

One of the reasons all this is so complicated relates to a design decision made in Fortran 90 & 95. These versions of the language allow dynamic allocation of memory for arrays with either the POINTER or ALLOCATABLE attribute. In ROMS 2 the data arrays are components of structure with names like OCEAN or GRID. The Fortran 90 and 95 rules allow only POINTER arrays to be used in this context (because the language designers wanted to limit the situations where ALLOCATABLE could be used, so it could be kept simple and easily optimised). But POINTERs are complicated beasts and when we pass a POINTER array to a subroutine it *can* be associated with non-contiguous memory--in this situation the compiler must copy the data into a contiguous memory area for the subroutine's benefit. In fact, when ROMS passes POINTER arrays to subroutines they normally are contiguous, but some compilers (eg Digital Visual Fortran 5) are not smart enough to realise that.

The Fortran language committee realised there was a problem here and in Fortran 2003 they weakened the restrictions on ALLOCATABLE arrays to allow them to be structure components. In the meantime, however, the people who make Fortran compilers also recognized the problem and improved the compilers' analysis of these situations. The result is that with a suitable compiler it is possible to replace all the POINTER declarations in ROMS with ALLOCATABLE declarations, however there is no speed-up. (This is quite easy to test, by the way: just add "-Dpointer=allocatable" to your CPP flags in the makefile.)

Anyway, I digress. You might want to re-try your tests with BENCHMARK1. By all means, report them on the forum. Then set ASSUMED_SHAPE to whatever works best. (In ROMS 2.2, preprocessor macros describing the Fortran compiler system, OS and hardware are all available, so this can be done with platform-specific code in globaldefs.h.) And if ROMS 2 remains much slower than ROMS 1 then I'm sorry, but this probably indicates that your compiler is not doing a good job.

By the way, it is irksome to have all those array declarations duplicated in the code and I know Hernan would like to remove them. But there are still some platforms on which ROMS works best with ASSUMED_SHAPE on and others on which it works best with ASSUMED_SHAPE off, so I guess we're stuck with them.

mizuta
Posts: 8
Joined: Tue Jul 15, 2003 9:01 pm
Location: Hokkaido University

#19 Unread post by mizuta »

I just got back from other works.

Thank you for your comments, Mark. I learned a lot of new things about ROMS 2x.

I tried BENCHMARK1 instead of UPWELLING. I got results which are different from those for UPWELLING examples. For BENCHMARK1, ROMS2.1 is about 1.3 times faster than 1.8 on our HITACHI's compiler. However, I found that this is just due to the difference of cppdefs. If I change cppdefs.h of BENCHMARK1, then ROMS 2.1 is again about 2 times slower than 1.8. What is important seems to be LMD_MIXING and BULK_FLUXES options. Since I use quite idealized model configuration for my research, I use simpler diffusion and surface boundary conditions. It seems that ROMS 2.1 becomes slower than 1.8 in such simple configuration on HITACHI's compiler.

I wonder whether or not such things happens in other compilers.

BENCHMARK1 on HITACH SR8000 (50 time steps)

Version of ROMS. . . . CPU time (in second)
------------------------------------------
1.8. . . . . . . . . . . . . . 89.2
2.1. . . . . . . . . . . . . . 67.6

In this example ROMS 2.1 is faster than 1.8. Detailed elapsed-time profile shows that the regions that consume the three most parts of cpu times is "Atmosphere-Ocean bulk flux" (Atoms-Ocean), "KPP vertical mixing" (KPP param), and "Model 2D kernel" (2D kernel) in both ROMS 2.1 and 1.8. (To estimate these values I added the same clock routines as ROMS 2.1 to 1.8. )


BENCHMARK1 on HITACH SR8000 (50 time steps)

Region Name. . . CPU time (%) in ROMS 1.8. . CPU time (%) in ROMS 2.1
--------------------------------------------------------------------------
Atoms-Ocean. . . .44.6 (50.0) . . . . . . . . . . . 17.2 (25.4)
KPP param. . . . . 29.0 (32.6) . . . . . . . . . . . . 24.3 (35.9)
2D kernel. . . . . . .4.3. (4.8) . . . . . . . . . . . . . 7.7 (11.4)

Total. . . . . . . . . 89.2 (100.0). . . . . . . . . . . .67.6 (100.0)


The first two regions consume more than half of the total cpu time, and these regions runs faster in ROMS 2.1 than 1.8. This is quite different from UPWELLING example. In UPWELLING example about half of cpu time is consumed by "Model 2D kernel": The top three of time consuming regions are "Model 2D kernel", "3D equations predictor step" (3D predic), and "Pressure gradient" (Pressure) for ROMS 1.8 and "Model 2D kernel", "Processing of output time averaged data" (Time avg), and "Corrector time-step for 3D momentum" (Corr 3D mom) for ROMS 2.1.


UPWELLING on HITACH SR8000

Region Name. . . CPU time (%) in ROMS 1.8. . CPU time (%) in ROMS 2.1
--------------------------------------------------------------------------
2D kernel. . . . . .60.9 (58.1) . . . . . . . . . . . . 102.7 (41.6)
3D predic. . . . . . 8.5 ( 8.1) . . . . . . . . . . . . . 13.8 ( 5.6)
Pressure . . . . . . 6.5 ( 6.2) . . . .. . . . . . . . . . .8.2 ( 3.3)
Corr 3D mom. . . 4.4 ( 4.2) . . . . . . . . . . . . . 24.8 (10.0)
Time avg. . . . . . 0.6 ( 0.5) . . . . . . . . . . . . . 48.3 (19.6)

Total. . . . . . . .104.9 (100.0). . . . . . . . . . . .246.6 (100.0)

2D kernel and other basic parts of 3D equations consume most parts of cpu time for both ROMS 1.8 and 2.1 and they are faster in the former. It looks curious to me why Time-averaging region needs so long time for ROMS 2.1. Is this related with ALLOCATE statements used there?

I changed cppdefs.h of BENCHMARK1 example (MYBENCHMARK1), and verified that ROMS 2.1 is 1.7 times slower than 1.8 even for larger grid numbers for a simple configuration. (The difference from UPWELLING example is attributed to time-averaging region, which is not included in this new example.) In this example 2D kernel again consumes most parts of cpu time.


MYBENCHMARK1 on HITACH SR8000 (500 time steps)

Region Name. . . CPU time (%) in ROMS 1.8. . CPU time (%) in ROMS 2.1
--------------------------------------------------------------------------
2D kernel. . . . . . 40.1 (28.2) . . . . . . . . . . . . 76.7 (32.6)
3D predic. . . . . . 25.7 (18.1) . . . . . . . . . . . . 13.8 ( 5.6)
Corr 3D t(*1). . . .13.6 ( 9.6) . . . . . . . . . . . . 24.1 (10.3)
Corr 3D mom. . . 12.3 ( 8.7) . . . . . . . . . . . . 28.7 (12.2)

Total. . . . . . . . .142.1 (100.0). . . . . . . . . . .234.9 (100.0)


(*1) Abbreviation of "Corrector time-step for tracers".
(*2) in cppdefs.h the following things that were in BENCHMARK1 is
changed from define to undef:

LMD_MIXING, BULT_FLUXES. SOLAR_SOURCEa, ALBEDO, ANA_SRFLUX

The following things were newly added as defined in MYBENCHMARK1:

ANA_SMFLUX, ANA_BTFLUX, ANA_BSFLUX, ANA_STFLUX, ANA_SSFLUX.

Note that total time steps were increased from 50 to 500 to get enough time resolution. The other things were kept being the same as those in BENCHMARK1.

So, basically, I will have to wait HITACH's compiler to be wise enough to treat ALLOCATE and POINTER statements efficiently to use ROMS 2.1.

I have verified that 2D kernel slows down again for ROMS 2.1 on Fujitsu compiler on my PC though it is not so drastic as HITACHI's compiler. Has anyone experienced the same thing on other compilers? I slightly fear that there have been no evidence that such things do not happen at all in other compilers.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#20 Unread post by m.hadfield »

Sacha's recent post

viewtopic.php?p=613&highlight=#613

highlights another issue that affects performance on some compilers but not others: the automatic allocation and deallocation of memory required to support automatic arrays.

Here, by the way, is the definition of an automatic array:

An automatic array is an explicit-shape arreay that is a local variable. Automatic arrays are only allowed in function and subroutine subprograms and are declared in the specification part of the subprogram. At least one bound of an automatic array mucst be a nonconstant specification expression [otherwise it's an ordinary local, explicit-shape array, for which storage can be allocated statically]. The bounds are determined when the subprogram is called.

mizuta
Posts: 8
Joined: Tue Jul 15, 2003 9:01 pm
Location: Hokkaido University

#21 Unread post by mizuta »

Thank you Mark. I learned another new thing. What is regarded as a bad example of programing in Fortran 77 is not so in Fortran 90. If I change the 2D kernel program of ROMS2x so that it uses older technique for work arrays, I may be able to get better performance in HITACHI's compiler. Let me try this. But since our machine is taking a break for a month for maintenance from December, I may not be able to do this very soon... (HITACHI's compiler may become slightly wiser after the maintenance, but I am not so optimistic about this. ) But anyway thanks again for your valuable comments.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Assumed-shape vs explicit-shape

#22 Unread post by m.hadfield »

In an earlier message in this thread I wrote:
In my opinion, it is to be expected that turning ASSUMED_SHAPE on and off will make a modest difference in execution time, but a large difference indicates there is something suspect about the Fortran compiler.
I decided to look into this further by running the BENCMARK1 case in SERIAL mode varying the compiler, the ASSUMED_SHAPE setting and the tiling. I thought some of you might be interested and somewhat surprised by the results... (All tests on a Pentium Xeon 2.4 GHz running Linux, otherwise lightly loaded. Code is ROMS 2.2 with recent updates. All settings standard unless otherwise indicated.)

Code: Select all

TILING  ASSUMED  EXPLICIT

Intel Fortran 8.1 (-static -ip -O3 -pc80 -xW)

1 × 1     1230      1240
1 × 8     1020      1450
2 × 4      990      1455
2 × 8      930      1890
4 × 8      915      2890
8 × 8      990      5066
8 × 1     1140      1540

G95 Oct  3 2005 (-O3 -ffast-math)

1 × 1     1670      1565
1 × 8     1565      1430
2 × 4     1495      1375
2 × 8     1520      1390
4 × 8     1520      1300
8 × 8     1460      1315
8 × 1     1495      1520

Gfortran 20051101 (-ff2c -frepack-arrays -O3 -ffast-math)

1 × 1     1505    compiler
1 × 8     1390      error
2 × 4     1340    
2 × 8     1375    
4 × 8     1350    
8 × 8     1400    
8 × 1     1495    
First, as has been observed previously, the commercial compiler (Intel Fortran) is significantly faster than the open-source ones (G95 & Gfortran). Second, on the Xeon processor times usually improve as the number of tiles increases and the optimum tiling (4 x 8) has tiles that are much wider than they are tall, bearing in mind that the BENCHMARK domain is wide (512 x 64 for BENCHMARK1). But the remarkable thing is the way the Intel Fortran compiler bucks the second trend with explicit-shape declarations. For this combination, execution time increases rapidly with number of tiles and for 8 x 8 reaches a whopping 5066 seconds! Go figure.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#23 Unread post by m.hadfield »

Somehow, "four times eight" in the previous message came out as "four times smiley". Further evidence that computer software is willfully perverse :)

mizuta
Posts: 8
Joined: Tue Jul 15, 2003 9:01 pm
Location: Hokkaido University

#24 Unread post by mizuta »

This is not a direct answer but I wonder if this may be some hint for someone else. How does the commercial Intel Fortran compiler use CPUs? In my understanding one Xeon processor has two CPUs in it (is this correct?). In our HITACHI's machine which is an extreme example 1x1 tiling always gives the best performance. The processor of our machine has multiple CPUs and HITACHI's compiler tries automatically to divide its job given by a program evenly for all CPUs. So I think that it prefers a job that consists of one big piece. But I do not have good explanation for the difference between explicit and implicit assumed shapes...

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#25 Unread post by m.hadfield »

Hi Mizuta

The machine I used does have 2 CPUs, but the numbers I quoted were all for a model run in serial mode, so it used only one of them. There were no other CPU-intensive processes on ther machine. It is strange but true that on Intel x86-family CPUs a serial run will normally go faster if there are multiple tiles.

ce107
Posts: 10
Joined: Tue Jul 01, 2003 10:31 am
Location: MIT,EAPS

#26 Unread post by ce107 »

m.hadfield wrote:It is strange but true that on Intel x86-family CPUs a serial run will normally go faster if there are multiple tiles.
Actually one would expect this type of behaviour on all cache-based processors (possibly including cache-based vector processors like the later Cray ones - though things could be different there in terms of memory reference to flop balance). A multiple tile scenario enhances the chance of a working set fitting in cache at the expense of some extra inter-tile comms. The optimal shape of a tile would be cache/processor FPU pipeline/compiler optimization level & capabilities dependent.

Constantinos

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

#27 Unread post by kate »

It is once again time for benchmarking here at ARSC. I know we've always run the BENCHMARK problems without any I/O to get at processor speeds. I want to point out that this was actually misleading on the Cray X1 (which DoD has asked us to consider retiring sooner rather than later). Once we managed to get the model physics to parallelize, ROMS ran quite respactably on the Cray. Doing a more typical model configuration with daily outputs of history/average/restart caused the code to slow down drastically. it seems that I was asking the netcdf library to do double->single conversions to save on disk space, but that one conversion routine was running in scalar mode and dominating the run time. Hence I'm running on the IBM. Anyway, I'm thinking I might just run BENCHMARK3 with and without I/O, maybe even try with DIAGNOSTICS. I like BENCHMARK3 because it is large enough to require 64-bits in serial mode, much like our real applications.

Anyone heard of ClearSpeed coprocessors for Linux Opteron systems? They seem great if you are doing Linpack or FFT's, but I don't know that we can take advantage of them, short of rewriting ROMS in their dialect of vector C.

smulcahy

#28 Unread post by smulcahy »

shchepet wrote:A reference point should be established and reported for every MPI test case, say running the problem using 1 CPU and optimal partition to properly utilize its cache, and using full compiler optimization and the best compiler available.
Hi Alexander,

I've been reading this thread trying to understand how to characterise the performance of ROMS. Are you suggesting that the performance of ROMS jobs is optimal when the working set fits into the cpu cache? If so, which cache are you referrring to, L1 or L2?

If this is the case, what can be done to tune the size of the working set? What model parameters can be modified to change the size of the working set - tile size?

How does one calculate the size of the working set of a ROMS job (if that is the correct term to use, apologies if I'm abusing terminology). Is it simply the following or is there additional data I'm omitting?

i X j X no. of model fields X depth (if 3D) x field size

?

Thanks,

-stephen

User avatar
shchepet
Posts: 188
Joined: Fri Nov 14, 2003 4:57 pm

#29 Unread post by shchepet »

Steve,

I did not see your message until recently. Regarding my post about establishing 1CPU reference point is to evaluate the performance of "internal code", which is consistently overlooked. Not only in ROMS discussion on this board, but in a wider community as well: people tend to be happy with "scaling" or "parallel speedup" on multiple CPUs, but the easies way to achieve it is to slow down execution inside each processor.

The "cache use" optimization refers to both main memory -to- L2 cache reads and writes, L2-to-L1, and L1-to-registers. It is
generally believed that L2 cache misses are the most penalysing:

Consider a 3.2 GHz Pentium 4 with 800MHz FSB (hence 200MHz memory clock, dual-channel DDR memory system).
If CAS latency is 2.5 (a typical for either mainstream memory
or ECC memory), then each stall due to L2 miss translates into 40 clock cycles wasted (step multiplier 16 * 2.5 CAS latency).


L1-to-register optimization comes from the fact that modern day CPU can typically execute 1 multiply-add operation per clock cycle, and load or store one double precision number. This means that that

Code: Select all

    do i=1,10000
        a(i)=b(i) + c(i)*d(i)
    enddo
can reach only 1/4 of theoretical floating-point speed of the processor even if the all relevant data resides in L1 cache.
This is because each iteration of this look executes only one multiply-add, but it must also perform 4 load-stores.

In practice this means that you have to increase computational density (ratio of FLOPS to load/stores in the loop), and this can be achieved by structuring you code to fuse consecutive loop as much as you can and to eliminate intermediate arrays whenever possible.

The cache effects in ROMS can be illustrated by playing with partitioning and comparing run time for different NSUB_X and
NSUB_E. This is described in "Poor Man's Computing".
But as it is said there "compromise is very compromising" because you can improve it to a certain limit by tiling: if it is too fine, then your loops are short resulting in performance loss (pipelining creates wind-up, steady state, and wind-down phases for each innermost loop, and optimum performance is reached only in the steady state); and fraction of redundant computations on the sides of tiles also increases.

...This is all theory of course. To get some sense, try to execute the simplest example, just 2D SOLITON example
setting

Code: Select all

            grid size   384 x 128
           time step 0.005
           NTIMES =  2400
           basic second-order numerics
           no viscosity (just undefine relevant CPPs)
(keep the above the same for all experiments)
and playing partitions

Code: Select all

      NtileI=2               in UCLA/Agrif these are similar
      NtileJ = 22            NSUB_X and NSUB_E
using single or dual CPU (more if you have).

and defining or undefining ASSUMED_SHAPE in global
definitions.

And compare the above agains NtileI=1 NtileJ = 1 (single
CPU only); as well as NtileI=1 NtileJ = 2 (single and dual).

See how it goes and report back what you find.

MJiang
Posts: 21
Joined: Mon Apr 18, 2005 6:41 pm
Location: Florida Atlantic University

#30 Unread post by MJiang »

hi Rich,

I can't find your benchmark in your earlier post. Could you update the link or send it to me directly? I am trying to test OpenMP with gFortran but get a strange result. The time (clock) with OpenMP is actually more than serial run. Here is an outline of my run:

System: AMD Opteron x86_64, dual 2.2GHz processors, Fedora Linux, 2GB RAM
Compiler: GFortran 4.3 (20071117 experimental)
Test case: Benchmark2 medium resolution 1028x128, tiling 2x2
Netcdf 3.6.2
Compiling option: -m64 -frepack-arrays -fopenmp -O3 -ffast-math -ffree-form -ffree-line-length-none

Serial compiling use same option except turning off openmp.

The result: 4248 (1CPU) vs 4860 (2CPU).

I also tried tiling 8x1-- did not help. The test with upwelling case came out similar comparison. I am wondering if I did anything wrong. Or is this due to GFortran?

Mark,

when you did your GFortran & OpenMP, did you compare with serial runs?

thanks.

Mingshun

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

#31 Unread post by m.hadfield »

Times reported by ROMS with GFortran/OpenMP are broken, I think. I haven't established why.

stef
Posts: 192
Joined: Tue Mar 13, 2007 6:38 pm
Location: Independent researcher
Contact:

Re: Benchmarks

#32 Unread post by stef »

met_c48xlarge.png
met_c48xlarge.png (63.71 KiB) Viewed 67131 times
Hi, I'm tinkering with benchmarks on Amazon Web Services (AWS). The attached figure shows time (in seconds) spent per process for the ROMS ”large” benchmark test (benchmark3.in), as function of the number of processes. Axes are logarithmic with base 2. Computations are performed on c4.8xlarge instances of AWS, which have 36 vCPUs per node. I used only 32 vCPUs of the c4.8xlarge to avoid potential problems with some Linux operating systems which have a vCPU limit of 32. These instances feature custom Intel Xeon E5-2666 v3 (Haswell) processors.

AWS defines a ”vCPU” as a hyperthread of the Intel Xeon processor. Note that a stock Xeon E5-2666 v3 has 10 cores and 2 threads per core, but that AWS uses a ”custom” version.

Software stack:

Operating system: CentOS Linux release 7.2.1511
Linux kernel: v3.10.0 x86_64
Fortran compiler: gcc-gfortran 4.8.5
MPI library: openmpi 1.10.0

More info, including a tentative cost estimate for a realistic study, can be found https://poidl.github.io/awsroms/. Let me know if you find mistakes.

Post Reply