Hi all,
sorry if this is "obvious" stuff, but I've trawled the forums and am having a hard time putting together some pieces. If someone can answer more-or-less clearly a few things, it would help me tremendously.
Basic context: I'm a sysadmin setting up a cluster to run Roms 2.2 (possibly 3.0 at a later time). Cluster is two-way dual-core opteron, gig-ether interconnect, Rocks (Redhat/centOS) platform, using PGI compiler to build (Intel/ifort seemed too hellish after first attempts for MPI)
I've got a "successful" (ie, it works but performance stinks) build of Roms thus,
-PGI compiler suite (latest and greatest version)
-MPICH for MPI, default config/install - compiled with PGI "by hand"
"performance stinks" means that as we add more CPUs the overall runtime *increases*. Thus, a 4-cpu job (single node) takes 30minutes with a test data set; then the same data set on 8-CPU run takes approx 60 minutes, and then 16 CPU it takes about 80-90minutes. In all cases they run as straight MPI-only job though, launched in identical manner. (Brief review of output suggests that "Halo exchange" is punishing us with the MPI scale-up? and also that 2d analysis phase in particular is suffering .. ? but alas I'm not really familiar with this, being a "sysadmin-type", not a "modeller-type")
I'm curious,
* What is the "recommended MPI for best performance" with roms 2.2? Roms 3.0? (one posting I've seen suggests that LAM is much better than MPICH ; now it seems LAM is replaced by "OpenMPI" though? and I see no mention of it anywhere.. and also I gather PGI may have available an integrated "tuned" MPI of some kind too? (not just a MPICH rebuild?) )
* Is there any option (now?future?) for "hybrid" builds, ie, OpenMP for SMP operation withing single SMP cluster node, but MPI job spanning multiple nodes in cluster ? I've seen discussion in forum on this topic on-and-off but haven't exactly seen clear concensus..
* IF anyone feels so inclined, pointers or "specific build hints" for a given "recommended / working well" MPI/PGI built setup .. would certainly be **VERY** welcome also. For that matter, comments on possible benefits of migrating successfully from Roms 2.2->3.X would also not be unwelcome
(I've tried,for example to build my Roms using not just Mpich, but also tried LAM and OpenMPI. The OpenMPI build attempts have simply failed so far - odd link/library issues (?) - and my attempt to build with LAM has been semi-successful, in that I believe I have a binary now compiled which launches, but I'm not certain it actually works (more testing needed on calling / launching it properly; slight hassle since cluster uses "SSH-no-passwords-needed" not RSH which is default of LAM? ugh?)
If anyone actually recommends it, I can also do builds using ifort,not just PGI, but I gather that ifort is a bit messy to build MPI Roms .. (?)
And - of course - I will summarize and post back to this thread any findings / progress I have with this topic, in case it is of use / interest to others.
Many thanks,
--Tim Chipman
"Basic Groundwork"(MPI-choice/ etc) -PGI_Compiled
The poor performance can be produced by a network overload...
When you run a application in only a physical node (4 processes, one by each virtual processor), Is there traffic in the network?? If there are traffic in the network using only a physical node (4 processes in your configuration), something fails....
The communication among processes in SAME physical machine must use the memory, not the network....
I tested some MPI implementations and finally i use the MPICH2
In order to use memory for communications between processes in same physical machine, you must compile the MPICH2 package passing to the configure script the option:
You can read the page 12 ("choose the communication device") of the "MPICH2 installer's guide" document from the MPICH official site for more info about this option...
Regards
When you run a application in only a physical node (4 processes, one by each virtual processor), Is there traffic in the network?? If there are traffic in the network using only a physical node (4 processes in your configuration), something fails....
The communication among processes in SAME physical machine must use the memory, not the network....
I tested some MPI implementations and finally i use the MPICH2
In order to use memory for communications between processes in same physical machine, you must compile the MPICH2 package passing to the configure script the option:
Code: Select all
--with-device=ch3:ssm
Regards
Re: "Basic Groundwork"(MPI-choice/ etc) -PGI_Compiled
Some clue :
With Tyan S2885 (2-way+2-core 285 opteron), Roms is scaling very well, even using all cores per node. But I use openMPI + ifort (same result with pgi+openMPI). This is true since I upgrade my gig-ethernet to Infiniband. Before with gig-ethernet, past 2 nodes, the scaling was very poor.
Attention : processor and memory affinity are an issue and if your process pingpong from one processor to another (check with "taskset") this is wrong. OpenMPI do a very good job about cpu-affinity and with MPI in general. Please, Look openMPI forum and Faq.
Best regard
Jerome Lefevre
With Tyan S2885 (2-way+2-core 285 opteron), Roms is scaling very well, even using all cores per node. But I use openMPI + ifort (same result with pgi+openMPI). This is true since I upgrade my gig-ethernet to Infiniband. Before with gig-ethernet, past 2 nodes, the scaling was very poor.
Attention : processor and memory affinity are an issue and if your process pingpong from one processor to another (check with "taskset") this is wrong. OpenMPI do a very good job about cpu-affinity and with MPI in general. Please, Look openMPI forum and Faq.
Best regard
Jerome Lefevre