>>> First, here is the system configuration:
System information: Linux xxx.xxx.xxx.xx.xxx 2.6.18-128.1.6.el5 #1 SMP Tue Mar 24 12:05:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
ROMS version: 3.0. Revision: 265
Fortran compiler: gfortran 4.1.2
MPI: MPICH2 1.0.6. MPICH2 Device: ch3:nemesis
>>> With:
Resolution, Grid 01: 0123x0323x015, Parallel Nodes: 12, Tiling: 003x004
>>> Running in parallel gives the following error:
Code: Select all
3537 0 11:47:24 5.687041E-03 5.404082E+01 5.404651E+01 4.251759E+09
WRT_RST - wrote re-start fields (Index=1,2) into time record = 0000001
3538 0 11:47:36 NaN NaN NaN NaN
Blowing-up: Saving latest model state into RESTART file
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000002
>>> The same application for this system configuration ran for 88 days, with tiling of 2x4 (one processor per node)
>>> The same application for this system configuration fails after 14 days, with tiling of 2x8 (multiple processors per node)
>>> The same application for this system configuration fails after 2 days, with tiling of 4x6 (multiple processors per node)
>>> The same application works well in other cluster with the following system configuration:
Linux mxxx.xxx.xxx.xx.xxx 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:59:20 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
ROMS version: 3.0. Revision: 265
Fortran compiler: ifort 9.0
MPI: mpich-1.2.7p1
I would like to hear any ideas about why this is happening. It has been tried many tiling options without success.
Thanks in advance.