ROMS performance characteristics and tuning

Post ROMS benchmark results

Moderators: arango, robertson

Post Reply
Message
Author
smulcahy

ROMS performance characteristics and tuning

#1 Unread post by smulcahy »

Hi,

I'm providing support on a cluster running ROMS 3.0. My background is computing and Linux rather than modelling so apologies for any gaps in my posting.

The cluster consists of 20 Tyan s2891 boards each holding 2 Opteron 270s (2.0GHz dual core) for a total of 80 processing cores (each with 4GB DDR1 400MHz ECC ram). 19 of the nodes are diskless and mount their filesystems over NFS from the head node. The compute nodes use a dedicated gigabit network (with a HP Procurve 3400CL) to talk to each other (used for both MPI traffic and NFS traffic). The head node has a second dedicated gigabit connection to the public network. The operating system used on the nodes is a 64-bit build of Debian GNU/Linux "Etch" with Linux kernel version 2.6.17 (a Debian packaged version).

We are using this to run a ROMS model with the following characteristics,

3D model
Tile Size: 8x10
Grid Size: 1200x750x40

I am currently trying to determine whether the performance of the model running on this configuration can be tuned. We are also in the process of preparing a proposal for a new cluster to be used for similar work. In order to specify this new cluster I am trying to get a better understanding of the performance characteristics of ROMS in general and our ROMS model.

I have read "Poor Man's Computing Revisited" and the various performance related postings here but have yet to develop a clear picture of the performance characteristics of ROMS. Firstly, does the specific model being run influence the performance characteristics greatly? Or are the general characteristics the same regardless of the model? Secondly, does changing the tile size dimensions have much of a performance impact on the model? We intend to perform some characteristation tests ourselves but given the large number of variables we have to work with, any hints on what parameters to focus on or any existing known rules of thumb would be welcome. There are significant production demands on the cluster so scheduling time for tuning is difficult (in my ideal world I would spend a few weeks tuning os paramters, MPI parameters, compiler flags and model parameters and re-run the model after each parameter change but alas time is too limited to consider that).

Our ROMS model is currently using MPICH2 and is compiled with the PGI Fortran90 compiler. We did some initial tests with LAM-MPI but it had a bug in 7.1.1 relating to using /tmp over NFS which precluded us from using it with our configuration. Are there significant performance differences between ROMS using MPICH2 and LAM-MPI? I'm wondering whether we should revisit LAM-MPI with the latest revision and see if the performance and stability have improved for our configuration.

I've also seen suggestions that the work-load should be tailored to fit into the L2 cache - can someone explain how to calculate the size of the work load if this rule of thumb is still applicable?

In terms of characterising the performance of ROMS on our system - I'm currently using the output from Ganglia and various hand-run tools such as iostat, htop and ifstat to get a handle on resource usage. What do others here normally use for such characterisation. I'm currently trying to determine if our model is cpu-bound, memory-bound or interconnect-bound. I'm also trying to get a feel for how well our model will scale on a larger cluster - I'm seeing indications from some postings that there are limits to how many processors models can scale - can I calculate that from the characteristics of my model or do I need to test with increasing numbers of processors and try to infer a scaling curve from those tests?

With regard to interconnects - as mentioned, we are using Gigabit ethernet. I currently haven't enabled Jumbo frame support (we're using the default MTU of 1500) but am considering this as possible low-hanging fruit (if the messages we're passing around are large enough to benefit from this) - how are others calculating the size of their MPI messages? Are there benefits in moving to higher speed interconnects such as Myrinet or Infiniband or is the price/performance still best for gigabit? I've been looking at the network utilisation to determine if the interconnect is the bottleneck but given the bursty nature of ROMS MPI traffic my feeling is that I need more detailed analysis to determine what the peak demands on the interconnect are rather than average utilisation.

Comments and suggestions welcome. I'm happy to try and summarise responses into a "Beginners Guide to Tuning ROMS" if thats something that could be useful.

-stephen

smulcahy

#2 Unread post by smulcahy »

Hi,

We've enabled jumbo frames on our cluster and have seen the performance of one of our models improve by about 30%. This may be of interest to others looking at tuning their systems.

Thanks,

-stephen

User avatar
jivica
Posts: 172
Joined: Mon May 05, 2003 2:41 pm
Location: The University of Western Australia, Perth, Australia
Contact:

dual quad performance

#3 Unread post by jivica »

Hej guys have any of you tried to load ROMS on that structure?
I have build 6 HP blades 460c with dual quads so at the end 6x8=48 cores, 2G ram / blade...
I have tried ROMS with ifort_10.1.x and pgi_7.0.7 as well mich1-27 and mpich2 latest one, have GB net with HP procurve switch 1800 G24...

at the end I have load/CPU quite low for more than 16 threads;
*can not believe that GB net is responsible for that*
domain is 100x100x20 and very similar test with 376x170x20 and the load is:
tiles ////// load/CPU
6x8 round 9.5% (per CPU load)
5x8 round 9.8%
4x8 ..... 22%
3x8 ..... 30-40%
2x8 ..... 70-77%
2x4 ..... 90% this is basically run on one blade with 8 cores (2 dual quad)

So at the end it is faster to run the model on 2x8 than on 3x8 or even more....
What do you suggest to speed up things?
Have any of you had this kind of experience?

Cheers,
Ivica

Post Reply