How to increase ROMS parallel efficiency

Message

FengZhou · #1 Unread post by **FengZhou** » Thu Sep 06, 2007 6:08 am

Hi, everyone.

Has anyone compared the ROMS efficency of CPU usage under serial mode and parallel mode ?

I got ridiculous results. In serial mode, the cpu efficiency is close to 99%, comparing to the 33% cpu used under parallel mode. And parallel takes longer time than serial mode. All the setups of the model are the same except for the cpu allotments.

some information about the parallel configuration:
Resolution, Grid 01: 0059x0061x015, Parallel Nodes: 4, Tiling: 002x002

The job was submitted as:
nohup mpirun -np 4 -machinefile ~/mpi-1.2.7pgi/share/machines.LINUX ./oceanM External/ocean_bohai.in > logmpi.out &

using node of 7, 8, 9, 10.

while node 10 was not actually used.

Is it a compiler/compiling probelm or cluster communication probelm?

jpringle · #2 Unread post by **jpringle** » Fri Sep 07, 2007 1:48 am

The efficiency of ROMS in parallel mode is a strong function of 1) the grid size 2) the interconnect speed 3) the configuration of MPI, and 4) the configuration of the model (parameters such as NDTFAST can make a big difference).

It is hard to answer a question such as yours with no knowledge of the configuration of the model and the cluster.

I have run on the same cluster jobs with very high and very low efficiency, depending on the nature of the model run. If you are interested in pursuing this question, please post the details of the cluster, its interconnects, and your model configuration.

Jamie

FengZhou · #3 Unread post by **FengZhou** » Fri Sep 07, 2007 3:42 am

Hi, Jamie,

My configuration:

CPU/hardware: Dell PowerEdge 1850, Xeon 3.2G / 2G DDR / SCSI
Compiler location : pgi location /opt/pgi/linux86/7.0-3/;
mpi location ~/mpi-1.2.7pgi/

Operating system : Linux
CPU/hardware : i686
Compiler system : mpif90
Compiler command : mpif90
Compiler flags : -r8 -i4 -Kieee -Mdalign -Mextend

Resolution, Grid 01: 0059x0061x015, Parallel Nodes: 4, Tiling: 002x002

Physical Parameters, Grid: 01
=============================

103680 ntimes Number of timesteps for 3-D equations.
600.000 dt Timestep size (s) for 3-D equations.
40 ndtfast Number of timesteps for 2-D equations between

Activated C-preprocessing Options:

ANA_BSFLUX
ANA_BTFLUX
ASSUMED_SHAPE
AVERAGES
AVERAGES_AKS
AVERAGES_AKT
CURVGRID
DIFF_GRID
DJ_GRADPS

DOUBLE_PRECISION
EAST_FSGRADIENT
EAST_M2GRADIENT
EAST_M3NUDGING

EAST_M3RADIATION
EAST_TGRADIENT
EAST_VOLCONS
LMD_BKPP
LMD_CONVEC
LMD_DDMIX
LMD_MIXING
LMD_RIMIX
LMD_SKPP
M2CLIMATOLOGY
M2CLM_NUDGING
M3CLIMATOLOGY
M3CLM_NUDGING
MASKING
MIX_GEO_TS
MIX_GEO_UV
MPI
NONLINEAR
NONLIN_EOS
NORTHERN_WALL
POWER_LAW
PROFILE
QCORRECTION
!RST_SINGLE
SALINITY
SRELAXATION
SOLAR_SOURCE
SOLVE3D
SOUTHERN_WALL
SPLINES
SPHERICAL
SPONGE
TCLIMATOLOGY
TCLM_NUDGING
TS_U3HADVECTION

TS_C4VADVECTION
TS_DIF2
UV_ADV
UV_COR
UV_U3HADVECTION

UV_SADVECTION
UV_QDRAG
UV_VIS2
VAR_RHO_2D
VISC_GRID
WESTERN_WALL
ZCLIMATOLOGY
ZCLM_NUDGING

Thank you so much!

ZHOU

FengZhou · #4 Unread post by **FengZhou** » Fri Sep 07, 2007 4:17 am

About the interconnect, due to the mismatch between mpi-1.2.7 and Myrinet 2000, I have to use a low level 1000m bps hub.

qwang · #5 Unread post by **qwang** » Sun Sep 09, 2007 5:07 am

If you write the output frequently, it is possible that the parallel takes longer time than serial mode.

jpringle · #6 Unread post by **jpringle** » Sun Sep 09, 2007 4:15 pm

Roughly, the efficiency of the parallelization is reduced by the following

* Slow interconnects

*Small tiles, so that there is a lot of communications (which scales as the perimeter of tile) for each bit of internal computation in the tile (which scales as the number of gridpoints in the tile).

Your setup has a slow interconnect, and a small model domain. You will find that the parallelization efficiency increases as your model domain becomes larger. (And with such a small problem, do you really need a parallel solution?)

This is all very simplified, but a good rough start to understanding how the efficiency of the model scales with processor size.

Cheers,
Jamie