I think there is a parallel bug in ROMS floats specially when carrying out a perfect restart.
I took the ROMS floats 3D test problem and modified it a little :
(a) changed the Nx, Ny, Nz to make it a little bigger in terms of computational effort,
(b) modified the time step to allow the run to be stable,
(c) modified the ROMS/Include/flt_test.h file to include #define SOLVE3D, #define PERFECT_RESTART and #define RST_SINGLE CPP options and,
(d) ran the simulation from day 0 - day 0.8, then did a perfect restart and ran it from day 0.8 - day 1.6. The first ensemble of 240 particles was injected into the simulation on day 0.4 and the second was injected on day 0.9 (after the perfect restart procedure).
As I am attempting to debug a very large (~667 x 662 x 50) real-life simulation, I added the PERFECT_RESTART CPP option because we are using it for that application and as that application is large, the restart file gets too large to write out in double precision and we need to specify RST_SINGLE CPP option to control its size. The SOLVE3D CPP option was added to do a fully 3D floats test case.
Please find attached the two toms.in files (for the cold start/initial run and the restart), the two floats.in files (one with FRREC = 0 and the other with FRREC = -1 for the restart) and the modified flt_test.h file from ROMS/Include/. I also attached the files so that others can attempt to reproduce my simulations and confirm the existence of a bug in floats.
I first ran the test case in serial mode and the results were OK. When I run it in parallel however, upon doing a restart, the particles prior to the restart have the expected X,Y, etc. values but the new particles (injected upon/after the restart) all have X=0, Y=0 - i.e. they are all zeros! The results are illustrated in the attached plot where I have tracked the value of the X-coordinate in the ocean_flt.nc file. It shows the total number of zero-valued X-coordinates, total number of NaN-valued (1x 10^34 valued) X-coordinates and also the number of X-coordinates with finite values (0 < X < 10000) say. The sum of these three numbers should always be 480 (as a function of time) because that it the total number of particles I inject in to the simulation.
I am using ROMS SVN version 562 and am doing the simulations on an IBM Power 6 cluster both in serial and with MPI (i.e. MPI = on in the makefile).
Please let me know what you think and advise on how to fix this parallel bug.
Thank you.
Parallel bug in ROMS floats
Parallel bug in ROMS floats
- Attachments
-
- flt_test.h
- (1.13 KiB) Downloaded 256 times
-
- floats_flt_rst.in
- (9.35 KiB) Downloaded 286 times
-
- floats_flt_beg.in
- (9.34 KiB) Downloaded 269 times
-
- toms_flt_rst.in
- (83.73 KiB) Downloaded 278 times
-
- toms_flt_beg.in
- (83.73 KiB) Downloaded 245 times