Hi,
I am finding a problem while running ROMS job with same Initial Conditions, Forcing input, Grid and Boundary conditions.
But with different number of nodes 64 and 256.
Case 1:
for 64 processors:
#!/bin/bash
#BSUB -J HOOFS # job name
#BSUB -W 120:00 # wall-clock time (hrs:mins)
#BSUB -n 64 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -q incois # queue
#BSUB -e hoofs.error.%J # error file name in which %J is replaced by the job ID
#BSUB -o hoofs.output.%J # output file name in which %J is replaced by the job ID
#BSUB -x # Exclusive execution mode. The job is running exclusively on a host
mpirun -np 64 ./oceanM_saltrelax ./ocean_india_job.in
In ocean_india_job.in :
NtileI == 8 ! I-direction partition
NtileJ == 8 ! J-direction partition
Case 2:
for 256 processors:
#!/bin/bash
#BSUB -J HOOFS # job name
#BSUB -W 120:00 # wall-clock time (hrs:mins)
#BSUB -n 256 # number of tasks in job
#BSUB -R "span[ptile=16]" # run 16 MPI tasks per node
#BSUB -q incois # queue
#BSUB -e hoofs.error.%J # error file name in which %J is replaced by the job ID
#BSUB -o hoofs.output.%J # output file name in which %J is replaced by the job ID
#BSUB -x # Exclusive execution mode. The job is running exclusively on a host
mpirun -np 256 ./oceanM_saltrelax ./ocean_india_job.in
In ocean_india_job.in :
NtileI == 16 ! I-direction partition
NtileJ == 16 ! J-direction partition
I am running ROMS model with 2003 initial conditions upto 2008.
There is large difference between outputs of this two runs.
After 2 months SST difference is +/- 0.5 degree but after 2 years +/- 2 degree.
Please see the attached plots of SST_difference.
Can anyone explain me the reason for this. How to avoid or reduce this error.
Running ROMS Job with different number of nodes/processors
-
- Posts: 4
- Joined: Wed May 25, 2016 12:14 am
- Location: Indian Institute of Tropical Meteorology
Running ROMS Job with different number of nodes/processors
- Attachments
-
- SST_Diff_0535.png (48.78 KiB) Viewed 3802 times
-
- SST_Diff_0365.png (51.09 KiB) Viewed 3802 times
Re: Running ROMS Job with different number of nodes/processo
This sort of thing can lead to hours of entertainment.
I wouldn't try to debug this with a two-year simulation. Rather, run for ten timesteps, saving history output each step. Do this for two different tilings, maybe 1x4 vs 4x1. Run ncdiff on the outputs to see what changes first. Then resort to print statements at the i,j location(s) that appear in the diffs or else run two duelling debuggers.
I wouldn't try to debug this with a two-year simulation. Rather, run for ten timesteps, saving history output each step. Do this for two different tilings, maybe 1x4 vs 4x1. Run ncdiff on the outputs to see what changes first. Then resort to print statements at the i,j location(s) that appear in the diffs or else run two duelling debuggers.
Re: Running ROMS Job with different number of nodes/processo
Hopefully/probably Kate's approach will help you solve it. But I'm also wondering what sort of dynamics you are using. Can you post the cppdefs? Also, what are the values of Lm, Mm, and N? Ad do you mean by "SST difference"? PS the 16x16 solution looks to be going mentally unstable or something. Or maybe that's just me.
Re: Running ROMS Job with different number of nodes/processo
We had a similar problem, and it did lead to months of entertainment as Kate hinted.
We found that our problem was actually the compiler, not the source code. We found that whether we got these types of differences depended on the optimization level that we used in compiling; that is we did not get the error if we used a low level optimization (O1) but did get the error when we used the higher level optimization (O3) in compiling.
We found that we got this error / problem with one compiler and not with others (but I can not remember which compiler caused the problem off the top of my head).
What compiler and compiler flags are you using?
Cheers - Courtney
We found that our problem was actually the compiler, not the source code. We found that whether we got these types of differences depended on the optimization level that we used in compiling; that is we did not get the error if we used a low level optimization (O1) but did get the error when we used the higher level optimization (O3) in compiling.
We found that we got this error / problem with one compiler and not with others (but I can not remember which compiler caused the problem off the top of my head).
What compiler and compiler flags are you using?
Cheers - Courtney
Courtney Harris
Professor
Virginia Institute of Marine Sciences
http://www.vims.edu/about/directory/fac ... ris_ck.php
Professor
Virginia Institute of Marine Sciences
http://www.vims.edu/about/directory/fac ... ris_ck.php