Blowing up - Tiling problem

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
marianambot
Posts: 1
Joined: Tue Nov 24, 2009 2:25 am

Blowing up - Tiling problem

#1 Unread post by marianambot »

Hello everyone:

>>> First, here is the system configuration:

System information: Linux xxx.xxx.xxx.xx.xxx 2.6.18-128.1.6.el5 #1 SMP Tue Mar 24 12:05:57 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
ROMS version: 3.0. Revision: 265
Fortran compiler: gfortran 4.1.2
MPI: MPICH2 1.0.6. MPICH2 Device: ch3:nemesis

>>> With:
Resolution, Grid 01: 0123x0323x015, Parallel Nodes: 12, Tiling: 003x004

>>> Running in parallel gives the following error:

Code: Select all

3537  0 11:47:24  5.687041E-03  5.404082E+01  5.404651E+01  4.251759E+09
      WRT_RST   - wrote re-start fields (Index=1,2) into time record = 0000001
3538     0 11:47:36           NaN           NaN           NaN           NaN

Blowing-up: Saving latest model state into  RESTART file

     WRT_RST   - wrote re-start fields (Index=1,1) into time record = 0000002

>>> The same application for this system configuration ran for 88 days, with tiling of 2x4 (one processor per node)
>>> The same application for this system configuration fails after 14 days, with tiling of 2x8 (multiple processors per node)
>>> The same application for this system configuration fails after 2 days, with tiling of 4x6 (multiple processors per node)


>>> The same application works well in other cluster with the following system configuration:
Linux mxxx.xxx.xxx.xx.xxx 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:59:20 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
ROMS version: 3.0. Revision: 265
Fortran compiler: ifort 9.0
MPI: mpich-1.2.7p1

I would like to hear any ideas about why this is happening. It has been tried many tiling options without success.

Thanks in advance.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Blowing up - Tiling problem

#2 Unread post by kate »

Well, you know it works with ifort, so it could be smelling like a gfortran bug. Can you compare the output from day one of a gfortran run with day one of an ifort run? Can you try a newer version of gfortran? Can you look at the nature of the differences between day one (or even after just a few timesteps) of a good tiling and a bad tiling, both with gfortran? ncdiff is good for this sort of thing.

The guys here use a "module" program to allow multiple versions of compilers to coexist on a system. Our oldest system has five versions of sunstudio, six versions of pgi and four versions of pathscale. As a user, I can try several and sort the ROMS bugs from the compiler bugs more easily. Of course the software guys hate it because they need to have an MPI stack for each, perhaps also a netcdf for each.

Post Reply