How to increase ROMS parallel efficiency

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

How to increase ROMS parallel efficiency

#1 Unread post by FengZhou »

Hi, everyone.

Has anyone compared the ROMS efficency of CPU usage under serial mode and parallel mode ?

I got ridiculous results. In serial mode, the cpu efficiency is close to 99%, comparing to the 33% cpu used under parallel mode. And parallel takes longer time than serial mode. All the setups of the model are the same except for the cpu allotments.

some information about the parallel configuration:
Resolution, Grid 01: 0059x0061x015, Parallel Nodes: 4, Tiling: 002x002

The job was submitted as:
nohup mpirun -np 4 -machinefile ~/mpi-1.2.7pgi/share/machines.LINUX ./oceanM External/ocean_bohai.in > logmpi.out &

using node of 7, 8, 9, 10.

while node 10 was not actually used.

Is it a compiler/compiling probelm or cluster communication probelm?

jpringle
Posts: 108
Joined: Sun Jul 27, 2003 6:49 pm
Location: UNH, USA

#2 Unread post by jpringle »

The efficiency of ROMS in parallel mode is a strong function of 1) the grid size 2) the interconnect speed 3) the configuration of MPI, and 4) the configuration of the model (parameters such as NDTFAST can make a big difference).

It is hard to answer a question such as yours with no knowledge of the configuration of the model and the cluster.

I have run on the same cluster jobs with very high and very low efficiency, depending on the nature of the model run. If you are interested in pursuing this question, please post the details of the cluster, its interconnects, and your model configuration.

Jamie

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

#3 Unread post by FengZhou »

Hi, Jamie,

My configuration:

CPU/hardware: Dell PowerEdge 1850, Xeon 3.2G / 2G DDR / SCSI
Compiler location : pgi location /opt/pgi/linux86/7.0-3/;
mpi location ~/mpi-1.2.7pgi/

Operating system : Linux
CPU/hardware : i686
Compiler system : mpif90
Compiler command : mpif90
Compiler flags : -r8 -i4 -Kieee -Mdalign -Mextend

Resolution, Grid 01: 0059x0061x015, Parallel Nodes: 4, Tiling: 002x002

Physical Parameters, Grid: 01
=============================

103680 ntimes Number of timesteps for 3-D equations.
600.000 dt Timestep size (s) for 3-D equations.
40 ndtfast Number of timesteps for 2-D equations between

Activated C-preprocessing Options:

ANA_BSFLUX
ANA_BTFLUX
ASSUMED_SHAPE
AVERAGES
AVERAGES_AKS
AVERAGES_AKT
CURVGRID
DIFF_GRID
DJ_GRADPS

DOUBLE_PRECISION
EAST_FSGRADIENT
EAST_M2GRADIENT
EAST_M3NUDGING

EAST_M3RADIATION
EAST_TGRADIENT
EAST_VOLCONS
LMD_BKPP
LMD_CONVEC
LMD_DDMIX
LMD_MIXING
LMD_RIMIX
LMD_SKPP
M2CLIMATOLOGY
M2CLM_NUDGING
M3CLIMATOLOGY
M3CLM_NUDGING
MASKING
MIX_GEO_TS
MIX_GEO_UV
MPI
NONLINEAR
NONLIN_EOS
NORTHERN_WALL
POWER_LAW
PROFILE
QCORRECTION
!RST_SINGLE
SALINITY
SRELAXATION
SOLAR_SOURCE
SOLVE3D
SOUTHERN_WALL
SPLINES
SPHERICAL
SPONGE
TCLIMATOLOGY
TCLM_NUDGING
TS_U3HADVECTION

TS_C4VADVECTION
TS_DIF2
UV_ADV
UV_COR
UV_U3HADVECTION

UV_SADVECTION
UV_QDRAG
UV_VIS2
VAR_RHO_2D
VISC_GRID
WESTERN_WALL
ZCLIMATOLOGY
ZCLM_NUDGING


Thank you so much!

ZHOU

FengZhou
Posts: 52
Joined: Wed Apr 07, 2004 10:48 pm
Location: 2nd Institute of Oceanography,SOA

#4 Unread post by FengZhou »

About the interconnect, due to the mismatch between mpi-1.2.7 and Myrinet 2000, I have to use a low level 1000m bps hub.

qwang
Posts: 1
Joined: Tue Sep 28, 2004 12:03 am
Location: The University of British Columbia

#5 Unread post by qwang »

If you write the output frequently, it is possible that the parallel takes longer time than serial mode.

jpringle
Posts: 108
Joined: Sun Jul 27, 2003 6:49 pm
Location: UNH, USA

#6 Unread post by jpringle »

Roughly, the efficiency of the parallelization is reduced by the following

* Slow interconnects

*Small tiles, so that there is a lot of communications (which scales as the perimeter of tile) for each bit of internal computation in the tile (which scales as the number of gridpoints in the tile).


Your setup has a slow interconnect, and a small model domain. You will find that the parallelization efficiency increases as your model domain becomes larger. (And with such a small problem, do you really need a parallel solution?)

This is all very simplified, but a good rough start to understanding how the efficiency of the model scales with processor size.

Cheers,
Jamie

Post Reply