Checking parallel bugs in ROMS floats

Message

arango · #1 Unread post by **arango** » Fri Sep 16, 2011 8:19 pm

I am re-posting part of my replay here to a

reported parallel bug in ROMS floats. This is important and it has more visibility here in this thread. Notice nobody can replay here. This is done to maintain important messages visible to everybody and easy to find.

I spend several hours today trying to find a reported parallel problem in the floats that I cannot reproduce

Like Mark Hadfield, I ran the FLT_TEST but in a 3D configuration with 1x1, 1x2, and 2x2 tile partitions in distributed-memory. Mark was not able to reproduce the problem either. I modified NTIMES=140 (default 135) so I can restart at time-step 70 (NRST=70). I activated PERFECT_RESTART and OUT_DOUBLE. Therefore, I ran 6 experiments:

Code: Select all

 
 Exp 1:    1x1   no restart
 Exp 2:    1x1   restart at time-step 70
 Exp 3:    1x2   no restart
 Exp 4:    1x2   restart at time-step 70
 Exp 5:    2x2   no restart
 Exp 6:    2x2   restart at time-step 70

If I difference the output NetCDF files for Exp1, Exp 3, and Exp 4, they are all identical (byte-by-byte). This implies that we don't have parallel bugs for this configuration (physics and Lagrangian trajectories). The same can be obtained when I compare Exp 2, Exp 4, and Exp 6. Also, if I compare Exp 1 and Exp 2, only he history file is identical. The restart file is not identical because the records are swaped during the perfect restart. However, if you read the data in Matlab and compare the appropriate record for each state variable, they are identical (zero difference for each variable in the NetCDF file). We just have different record arrangement for the fields in the unlimited dimension. Now, the float trajectories are not identical byte-by-byte because we don't have a perfect restart for the floats. Recall, the the Lagrangian trajectories have a fourth-order Milne time-step scheme. It is tricky and unnecessary to provide a perfect restart for floats, as you can see in the animations below. Each animation show the trajectory of 240 particles in each experiment. As you can see, the solutions are identical to the eye. However, if you pay attention there are few differences that are due to the different initialization during restart, but we can live with such differences. The Lagrangian trajectories are highly nonlinear and chaotic. Any small perturbation may result in a different trajectory, which is the case for couple of particles. The perfect restart for floats requires us to save four different time level for the positions, trajectories, right-hand-side, and property fields. Also, the random walk complicates matters.

If I compare the floats NetCDF files for Exp 2, Exp 4, and Exp 6, they are also identical byte-by-byte. This is because the initialization during restart is identical regardless of the tile partition. This indicates that we don't have parallel bugs in the restarting of the floats.

I didn't have to change the dimension of the test problem (Lm=14, Mm=12). There is not need for such a thing. You can alter the configuration of this specific test. You need to look carefully how this test is configured inside ROMS. The floats are very modular and generic for any application. It is very unlikely to have a systematic design parallel bug for a particular application in the distributed code. Parallel bugs can be easily introduced when the user modifies their version code. We are not responsible for such bugs.

Exp 1, 1x1 no restart:

Exp 2, 1x1 restart at time-step 70:

Exp 3, 1x2 no restart:

Exp 4, 1x2 restart at time-step 70:

Exp 5, 2x2 no restart:

Exp 6, 2x2 restart at time-step 70: