I am re-posting part of my replay here to a
reported parallel bug in ROMS floats. This is important and it has more visibility here in this thread.
Notice nobody can replay here. This is done to maintain important messages visible to everybody and easy to find.
I spend several hours today trying to find a reported parallel problem in the floats that I cannot reproduce
Like Mark Hadfield, I ran the
FLT_TEST but in a
3D configuration with
1x1,
1x2, and
2x2 tile partitions in distributed-memory. Mark was not able to reproduce the problem either. I modified
NTIMES=140 (default
135) so I can restart at time-step
70 (
NRST=70). I activated
PERFECT_RESTART and
OUT_DOUBLE. Therefore, I ran
6 experiments:
Code: Select all
Exp 1: 1x1 no restart
Exp 2: 1x1 restart at time-step 70
Exp 3: 1x2 no restart
Exp 4: 1x2 restart at time-step 70
Exp 5: 2x2 no restart
Exp 6: 2x2 restart at time-step 70
If I difference the output NetCDF files for
Exp1,
Exp 3, and
Exp 4, they are all identical (byte-by-byte). This implies that we don't have parallel bugs for this configuration (physics and Lagrangian trajectories). The same can be obtained when I compare
Exp 2,
Exp 4, and
Exp 6. Also, if I compare
Exp 1 and
Exp 2, only he history file is identical. The restart file is not identical because the records are swaped during the perfect restart. However, if you read the data in Matlab and compare the appropriate record for each state variable, they are identical (zero difference for each variable in the NetCDF file). We just have different record arrangement for the fields in the unlimited dimension. Now, the float trajectories
are not identical byte-by-byte because we don't have a perfect restart for the floats. Recall, the the Lagrangian trajectories have a
fourth-order Milne time-step scheme. It is tricky and unnecessary to provide a perfect restart for floats, as you can see in the animations below. Each animation show the trajectory of
240 particles in each experiment. As you can see, the solutions are identical to the eye. However, if you pay attention there are few differences that are due to the different initialization during restart, but we can live with such differences. The Lagrangian trajectories are highly nonlinear and chaotic. Any small perturbation may result in a different trajectory, which is the case for couple of particles. The perfect restart for floats requires us to save four different time level for the positions, trajectories, right-hand-side, and property fields. Also, the random walk complicates matters.
If I compare the floats NetCDF files for
Exp 2,
Exp 4, and
Exp 6, they are also identical byte-by-byte. This is because the initialization during restart is identical regardless of the tile partition. This indicates that we don't have parallel bugs in the restarting of the floats.
I didn't have to change the dimension of the test problem (
Lm=14,
Mm=12). There is not need for such a thing. You can alter the configuration of this specific test. You need to look carefully how this test is configured inside ROMS. The floats are very modular and generic for any application. It is very unlikely to have a systematic design parallel bug for a particular application in the distributed code. Parallel bugs can be easily introduced when the user modifies their version code. We are not responsible for such bugs.
Exp 1, 1x1 no restart:
Exp 2, 1x1 restart at time-step 70:
Exp 3, 1x2 no restart:
Exp 4, 1x2 restart at time-step 70:
Exp 5, 2x2 no restart:
Exp 6, 2x2 restart at time-step 70: