Running ROMS Job with different number of nodes/processors

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
balaji426
Posts: 4
Joined: Wed May 25, 2016 12:14 am
Location: Indian Institute of Tropical Meteorology

Running ROMS Job with different number of nodes/processors

#1 Unread post by balaji426 »

Hi,

I am finding a problem while running ROMS job with same Initial Conditions, Forcing input, Grid and Boundary conditions.
But with different number of nodes 64 and 256.
Case 1:
for 64 processors:

#!/bin/bash
#BSUB -J HOOFS # job name
#BSUB -W 120:00 # wall-clock time (hrs:mins)
#BSUB -n 64 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -q incois # queue
#BSUB -e hoofs.error.%J # error file name in which %J is replaced by the job ID
#BSUB -o hoofs.output.%J # output file name in which %J is replaced by the job ID
#BSUB -x # Exclusive execution mode. The job is running exclusively on a host

mpirun -np 64 ./oceanM_saltrelax ./ocean_india_job.in

In ocean_india_job.in :
NtileI == 8 ! I-direction partition
NtileJ == 8 ! J-direction partition


Case 2:
for 256 processors:

#!/bin/bash
#BSUB -J HOOFS # job name
#BSUB -W 120:00 # wall-clock time (hrs:mins)
#BSUB -n 256 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -q incois # queue
#BSUB -e hoofs.error.%J # error file name in which %J is replaced by the job ID
#BSUB -o hoofs.output.%J # output file name in which %J is replaced by the job ID
#BSUB -x # Exclusive execution mode. The job is running exclusively on a host

mpirun -np 256 ./oceanM_saltrelax ./ocean_india_job.in

In ocean_india_job.in :
NtileI == 16 ! I-direction partition
NtileJ == 16 ! J-direction partition

I am running ROMS model with 2003 initial conditions upto 2008.
There is large difference between outputs of this two runs.
After 2 months SST difference is +/- 0.5 degree but after 2 years +/- 2 degree.
Please see the attached plots of SST_difference.

Can anyone explain me the reason for this. How to avoid or reduce this error.
Attachments
SST_Diff_0535.png
SST_Diff_0535.png (48.78 KiB) Viewed 3826 times
SST_Diff_0365.png
SST_Diff_0365.png (51.09 KiB) Viewed 3826 times

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: Running ROMS Job with different number of nodes/processo

#2 Unread post by kate »

This sort of thing can lead to hours of entertainment. :twisted: :roll:

I wouldn't try to debug this with a two-year simulation. Rather, run for ten timesteps, saving history output each step. Do this for two different tilings, maybe 1x4 vs 4x1. Run ncdiff on the outputs to see what changes first. Then resort to print statements at the i,j location(s) that appear in the diffs or else run two duelling debuggers. :shock:

johnluick

Re: Running ROMS Job with different number of nodes/processo

#3 Unread post by johnluick »

Hopefully/probably Kate's approach will help you solve it. But I'm also wondering what sort of dynamics you are using. Can you post the cppdefs? Also, what are the values of Lm, Mm, and N? Ad do you mean by "SST difference"? PS the 16x16 solution looks to be going mentally unstable or something. Or maybe that's just me.

User avatar
ckharris
Posts: 40
Joined: Wed Nov 03, 2004 4:37 pm
Location: VIMS
Contact:

Re: Running ROMS Job with different number of nodes/processo

#4 Unread post by ckharris »

We had a similar problem, and it did lead to months of entertainment as Kate hinted.

We found that our problem was actually the compiler, not the source code. We found that whether we got these types of differences depended on the optimization level that we used in compiling; that is we did not get the error if we used a low level optimization (O1) but did get the error when we used the higher level optimization (O3) in compiling.

We found that we got this error / problem with one compiler and not with others (but I can not remember which compiler caused the problem off the top of my head).

What compiler and compiler flags are you using?

Cheers - Courtney
Courtney Harris
Professor
Virginia Institute of Marine Sciences
http://www.vims.edu/about/directory/fac ... ris_ck.php

Post Reply