ROMS restart and floats

Message

lanerolle · Tue Sep 06, 2011 11:06 pm

If I am continuing a ROMS simulation using a perfect or imperfect restart and it has floats (in both the previous and current simulation), does the restart simulation :

(a) get the floats locations from the previous simulation or?
(b) get the floats locations from the floats.in file (specified in toms.in)?

If the latter then, I guess there is some manual work involved where we will need to read-in the ocean_flt.nc file, get the floats lon, lat, depth values and write-out a fresh floats.in file to be used with each restart run.

kate · #2 Unread post by **kate** » Wed Sep 07, 2011 12:01 am

You can set it either way, based on FRREC in the floats.in file. It's annoying that you have to edit floats.in on restart for just this, but so it is.

m.hadfield · Wed Sep 07, 2011 12:46 am

As I understand it, if you choose to read float locations from the restart file, then you don't read the floats.in file, which means you can't release any more floats. In some cases, this is a significant restriction.

Edit: Sorry, I didn't read the original post very carefully and I seem to have pointed out something that was implicit there. Yes, I guess you could hack the floats input file on every restart so that it included the current location of the already-released floats plus the source info for the not-yet-released ones. But a solution that reads both the restart file and the floats-input file and merges the info from the two would be better. Unless this is what ROMS already does, and I haven't noticed it, in which case I withdraw and unreservedly apologise.

kate · #4 Unread post by **kate** » Wed Sep 07, 2011 4:04 pm

ROMS always reads the floats.in file, which contains the switch for reading the floats.nc file or not. The floats.in file also contains the float release times, some of which could still be in the future. What you can't do on restart is change the number of floats if you are reading the floats.nc file.

lanerolle · #5 Unread post by **lanerolle** » Wed Sep 07, 2011 4:21 pm

Thanks for all your informative replies. Let me explain a little bit more about what I am trying to achieve. I am attempting to set-up a series of N-day runs (need to run in segments because the supercomputer only allows 4-hour time slots at a time) where subsequent runs are retarted (perfectly or imperfectly) from the corresponding previous runs.

Now, although this is a series of runs, I only want to use a single floats.in file for ALL of the runs (or run segments). This is because this single floats.in file releases a new set of particles every day and it covers the full duration of all of the run segments.

So if I put FRREC=-1 in the floats.in file, in the first run it will attempt to find floats.nc and as there is no such file, it will crash. So for the first run, I need to put FRREC=0. But what about the subsequent runs? Can I put FRREC=-1 and then they will restart from the previous floats.nc file(s) and will ROMS also read the floats.in file and add new particles to the simulation?

kate · #6 Unread post by **kate** » Wed Sep 07, 2011 4:37 pm

Yes, I believe that's how it's supposed to work. The only way to know for 100% sure is to try it, though. Let us know if something else happens.

m.hadfield · Thu Sep 08, 2011 12:02 am

kate wrote:Yes, I believe that's how it's supposed to work. The only way to know for 100% sure is to try it, though.

So it does. I found a toy run I set up several years ago to confirm that ROMS does handle float locations and float releases in the second and subsequent runs of a series seamlessly. But I had obviously forgotten that when I contributed to the thread above.

arango · #8 Unread post by **arango** » Thu Sep 08, 2011 4:42 pm

The input script floats.in is a template and as such it can be edited automatically with a job submission script to change the value of FFREC to the desired value. This is very trivial and we do it extensively in the adjoint-based algorithms. For example, in the script ROMS/Bin/job_i4dvar.sh you will find several modifications of the template script:

Code: Select all

# Set string manipulations perl script.

 set SUBSTITUTE=${ROMS_ROOT}/ROMS/Bin/substitute

...

# Modify 4D-Var template input script and specify above files.
 
 set I4DVAR=i4dvar.in
 if (-e $I4DVAR) then
   /bin/rm $I4DVAR
 endif
 cp s4dvar.in $I4DVAR

 $SUBSTITUTE $I4DVAR ocean_std_i.nc $STDnameI
 $SUBSTITUTE $I4DVAR ocean_std_b.nc $STDnameB
 $SUBSTITUTE $I4DVAR ocean_std_f.nc $STDnameF
 $SUBSTITUTE $I4DVAR ocean_nrm_i.nc $NRMnameI
 $SUBSTITUTE $I4DVAR ocean_nrm_b.nc $NRMnameB
 $SUBSTITUTE $I4DVAR ocean_nrm_f.nc $NRMnameF
 $SUBSTITUTE $I4DVAR ocean_obs.nc $OBSname
 $SUBSTITUTE $I4DVAR ocean_hss.nc wc13_hss.nc
 $SUBSTITUTE $I4DVAR ocean_lcz.nc wc13_lcz.nc
 $SUBSTITUTE $I4DVAR ocean_mod.nc wc13_mod.nc
 $SUBSTITUTE $I4DVAR ocean_err.nc wc13_err.nc

where substitute is a generic perl script distributed in ROMS/Bin. Similar strategy can be used operationally with floats.in and other input scripts. In our data assimilation cycles, we usually have elaborated scripts to run everything for time windows of days, months, years...

We have done that for forecasts systems running remotely aboard ships. We can do this in computers on the space station if we are allowed... This is the way that our space exploration robots work. They are usually operated by computers with scripts written by engineers.

lanerolle · #9 Unread post by **lanerolle** » Thu Sep 08, 2011 6:16 pm

I tried the suggestions of Mark and Kate. Unfortunately, for me, the floats is not working correctly or I have done something wrong!

To summarize, I am doing a simulation beginning on day 68.0 relative to a base date of 2011/01/01 0000 hrs GMT. In my floats.in file, I inject an ensemble of particles once a day beginning on day 11 - which corresponds to day 79.0 in the ROMS clock. I have run the code for 16 days - from day 68.0 to day 84.0 in 4 separate segments (as I only get 4 hours on the supercomputer and I can only do 4-days of simulations in this time slot because it is a huge application ~670 x 620 x 50 points) and hence I do 3 restarts. So the particles first come in to the simulation in segment 3 (day 79.0).

In the first segment, I set FRREC = 0 in the floats.in file and for the subsequent segments, I use the same floats.in file but with FRREC = -1 (i.e. use the latest positions from ocean_flt.nc to initialize the particle tracks).

When I look at the values on (lon, lat, depth) in the ocean_flt.nc file, I find that for the first run segment, these variables have a value such as 1.0e+34; this is fine because the particles have not come into the computation as yet. Unfortunately, I also find that for the next three run segments, the values of these variables are 0.0 and this is the case for all of the particles and all of the times. Hence, ROMS generates a ocean_flt.nc file containing zeros and does not appear to read the positions (and use them) from the floats.in file.

I am using ROMS version 562 which I believe is quite recent.

Any thoughts on why this is happening?

m.hadfield · Thu Sep 08, 2011 9:27 pm

One possibility: release times in the floats-input file need to be relative to DSTART.

lanerolle · Thu Sep 08, 2011 9:33 pm

Really? I thought the times in the floats.in file (in units of days) have to be relative to the initialization time which is 0.0!

m.hadfield · Thu Sep 08, 2011 9:54 pm

lanerolle wrote:Really? I thought the times in the floats.in file (in units of days) have to be relative to the initialization time which is 0.0!

That's what I expected, too

. But experience did not confirm my expectation

.

When all else fails, read the source.

lanerolle · Fri Sep 09, 2011 7:54 pm

I just checked and all of the times in my floats.in file(s) are relative to DSTART (in toms.in).

I find that when FRREC == 0, particles are correctly injected in to the ROMS simulations. However, when I use FRREC == -1, the particle info written in to the same ocean_flt.nc file (e.g. lon, lat, depth, etc.) are all zeros. So it is as if floats.in is not read at all when I use FRREC == -1.

Mark, could you please post a segment from your two floats.in files - the one with FRREC == 0 and the other with FRREC == -1? I will check your file format and parameters against mine to see whether I can get a clue about why my floats are not working when restarting?

Also, are you doing anything special during restarts? - I am using the PERFECT_RESTART and RST_SINGLE CPP options.

m.hadfield · Sun Sep 11, 2011 7:32 pm

Attached are input files for a two-segment float simulation using the built-in FLT_TEST (SOLVE3D off). The segments are 0.8 days long and floats are released (successfully) at 0.1 days and 0.9 days.

lanerolle · Tue Sep 13, 2011 11:40 pm

Thanks for posting the toms.in and floats.in files.

I took the 3D Floats ROMS test case, modified it a little and did some debugging runs. The serial run produces the expected results but the parallel run has problems upon carrying out a restart (a perfect restart).

Please see my posting under "ROMS Bugs". I am posting all of my toms.in and floats.in and flt_test.h files - could you please attempt to reproduce my run(s) in serial and parallel and confirm whether there is a parallel bug?

I did my MPI and serial runs on an IBM Power 6 cluster. I am using ROMS SVN version 562.

m.hadfield · Wed Sep 14, 2011 1:36 am

I re-ran the two-segment run (for which I posted the files a couple of messages back) with MPI enabled and it worked correctly, i.e. the floats released at 0.1 days made it through the restart at 0.8 days and were joined by those from the next release at 0.9 days. That's with SOLVE3D turned off.

PS: it's not a perfect restart.

arango · #17 Unread post by **arango** » Wed Sep 14, 2011 2:37 am

I have tested extensively the floats in serial and parallel (shared- and distributed-memory) in several test cases and realistic applications. I always get identical solutions and identical floats NetCDF files, bite-by-byte. This is the way that I frequently check for parallel bugs in ROMS. As a matter of fact, I always turn on AVERAGES, DIAGNOSTICS_TS, DIAGNOSTICS_UV, FLOATS, and STATIONS for several of the distributed test problems with ROMS. I run them with 1x1, 2x2, and 3x3 partitions in serial, shared-memory, and distributed-memory. Notice that I always run serial with partitions. I perform binary differences between average, diagnostics, floats, history, restart, and station NetCDF files for each partition to make sure that the solutions are identical. The output files used in these comparisons are always in double-precision (OUT_DOUBLE) to avoid round-off. In major releases, I make nine runs with each application using the newest version of the code and I compare against a reference version that I know was fully tested. This guaranties that the distributed code in the repository is free of parallel bugs

Parallel bugs can be easily introduced by customization of the svn distributed code. I have done that myself several times in the past

We always need to test every change made to the code inside a parallel region in the numerical kernel. The only parallel bug that I am aware is the one for TS_MPDATA. This option has a serial with partition and shared-memory bug for applications with north-south periodic boundary conditions. I have been hunting for this bug for more than a year

It is a hard one to find. I had mentioned this in the past in this forum. Notice that this bug is not in the distributed-memory version of the code.

I recall a shared-memory bug in the floats when random walk is activated. I reported and fixed this problem several months ago. See

trac ticket 474. This bug was due to the random number generator. It is very difficult to generate identical sequence of random numbers in shared-memory.

I don't see any relationship between the perfect restart and the restarting of the floats. This is done separately. I haven't done the test of restating the float in long time since I haven't modified that logic. I will have to reproduce this problem, if any, and check it in the debugger. I will look at this sometime this week. I am busy finishing the documentation for the new release.

kate · #18 Unread post by **kate** » Wed Sep 14, 2011 2:59 pm

I see others have been quite helpful here. My suggestion to Lyon would be to invest in a debugger and learn to use it since you have the time and energy for tracking this stuff down. Also, the first rule of debugging is to find the simplest case to exhibit the trouble. Is adding PERFECT_RESTART necessary to see the trouble? You didn't say. How about SOLVE3D? Ditto. Why are you making the problem larger without stating that it depends on size? If you expect one of us to run this on our system, make it as small as possible.

P.S. I wasn't planning on running this case at any size. If you want this level of support, perhaps I could set up a consulting business.

lanerolle · Thu Sep 15, 2011 1:15 am

When I run the floats with my modified test problem without the PERFECT_RESTART and RST_SINGLE CPP options, the serial and parallel (MPI) outputs are the same. However, when I include these two CPP options, the parallel ROMS outputs has zero X, Y values upon carrying out a restart and the problems begin.....

I need to use these two CPP options because I am attempting to debug the floats for a much larger, realistic ROMS application which takes a long time to run (1-day of simulation = 1-hour of CPU time using 256 processors) and that application employs these options.

m.hadfield · Thu Sep 15, 2011 1:51 am

lanerolle wrote:When I run the floats with my modified test problem without the PERFECT_RESTART and RST_SINGLE CPP options, the serial and parallel (MPI) outputs are the same. However, when I include these two CPP options, the parallel ROMS outputs has zero X, Y values upon carrying out a restart and the problems begin.....

I need to use these two CPP options because I am attempting to debug the floats for a much larger, realistic ROMS application which takes a long time to run (1-day of simulation = 1-hour of CPU time using 256 processors) and that application employs these options.

Weird. But now you have a reproducible bug, you're 50% of the way to solving it. The other 50% will require some work.

The first thing I suggest is to reduce the scale of your problem back down to the toy level and see if you still get the same behaviour. If you do, then get to work on the small case. If not, then you'll have to debug the large case, which is likely to be slower and more awkward.

By the way, array bounds checking is always your friend (except when you use it on a large problem, where it might introduce new problems by increasing the memory footprint, which is part of the reason for making your test problems small).

And clearly, seeing which combinations of PERFECT_RESTART and RST_SINGLE CPP cause problems is a high priority.

[Edit]Another thought: I have seen odd floats behaviour in parallel applications in the past, where floats "fall through the cracks" between tiles and so do no belong to any processor. And another thought: I have never used either PERFECT_RESTART or RST_SINGLE CPP, but how can a restart from single-precision stored data be perfect? If this is a foolish or irrelevant question, please ignore it.

lanerolle · Thu Sep 15, 2011 2:42 am

Many thanks for your thoughts and advice. The reason I use both PERFECT_RESTART and RST_SINGLE is because I am trying to emulate a very large, real-life ROMS application. I that application, we have to continue the ROMS simulations by breaking them in to segments (as we only get 4-hours of time on the supercomputer at a time) and the PERFECT_RESTART option provides the ideal solution - the run continues as if its a single, long run. However, when we use this CPP option, the restart files get large and for my application, it is too large to be written out (exceeds the 2.17 GB limit - need to work on this issue on ROMS too by trying out different NetCDF libraries for compilation!!!) and so we use RST_SINGLE to bring down the size of the restart file. Of couse this means that our restarts are not perfect but we can attempt to get close to perfection as possible.

wilkin · #22 Unread post by **wilkin** » Fri Sep 16, 2011 2:58 am

We need to add lines in globaldefs...

#ifdef RST_SINGLE
#undef PERFECT_RESTART
#endif

You can't have both.

arango · #23 Unread post by **arango** » Fri Sep 16, 2011 3:12 am

Well, I really don't understand single precision restart file RST_SINGLE and perfect restart. All ROMS computations are in double precision. If the restart file is single precision, you no longer have a perfect restart between simulations because of round-off.

I have tested this in the past for basically all the important CPP options. I make a full run of the application, say 21 days. Then, I make another run restarting every 7 days. At the end, I compare the two solutions and all the output NetCDF files are identical (byte-by-byte) when I perform a binary differences. This is only achieved if the output restart is in double-precision

If you use single precision restart file, the round-off start growing in time and the solutions differ after several simulation cycles. It is like adding an random perturbation to each initialization cycle. The round-off is random in some applications.

m.hadfield · Fri Sep 16, 2011 3:50 am

lanerolle wrote:The reason I use both PERFECT_RESTART and RST_SINGLE is because I am trying to emulate a very large, real-life ROMS application. I that application, we have to continue the ROMS simulations by breaking them in to segments (as we only get 4-hours of time on the supercomputer at a time) and the PERFECT_RESTART option provides the ideal solution - the run continues as if its a single, long run. However, when we use this CPP option, the restart files get large and for my application, it is too large to be written out (exceeds the 2.17 GB limit - need to work on this issue on ROMS too by trying out different NetCDF libraries for compilation!!!) and so we use RST_SINGLE to bring down the size of the restart file. Of couse this means that our restarts are not perfect but we can attempt to get close to perfection as possible.

I do multi-segment runs all the time and have never used PERFECT_RESTART (I tried once but it didn't work for some reason or other, probably now fixed). I have never noticed a problem caused by ordinary restarts. Actually I can think of one: the GLS mixing scheme goes crazy for a time step or two with an ordinary restart.

As John and Hernan have pointed out, using imperfect perfect restarts is a little quirky.

arango · #25 Unread post by **arango** » Fri Sep 16, 2011 8:04 pm

I spend several hours today trying to find a problem that I cannot reproduce

Like Mark, I ran the FLT_TEST but in a 3D configuration with 1x1, 1x2, and 2x2 tile partitions in distributed-memory. Mark was not able to reproduce the problem either. I modified NTIMES=140 (default 135) so I can restart at time-step 70 (NRST=70). I activated PERFECT_RESTART and OUT_DOUBLE. Therefore, I ran 6 experiments:

Code: Select all

 
 Exp 1:    1x1   no restart
 Exp 2:    1x1   restart at time-step 70
 Exp 3:    1x2   no restart
 Exp 4:    1x2   restart at time-step 70
 Exp 5:    2x2   no restart
 Exp 6:    2x2   restart at time-step 70

If I difference the output NetCDF files for Exp1, Exp 3, and Exp 4, they are all identical (byte-by-byte). This implies that we don't have parallel bugs for this configuration (physics and Lagrangian trajectories). The same can be obtained when I compare Exp 2, Exp 4, and Exp 6. Also, if I compare Exp 1 and Exp 2, only he history file is identical. The restart file is not identical because the records are swaped during the perfect restart. However, if you read the data in Matlab and compare the appropriate record for each state variable, they are identical (zero difference for each variable in the NetCDF file). We just have different record arrangement for the fields in the unlimited dimension. Now, the float trajectories are not identical byte-by-byte because we don't have a perfect restart for the floats. Recall, the the Lagrangian trajectories have a fourth-order Milne time-step scheme. It is tricky and unnecessary to provide a perfect restart for floats, as you can see in the animations below. Each animation show the trajectory of 240 particles in each experiment. As you can see, the solutions are identical to the eye. However, if you pay attention there are few differences that are due to the different initialization during restart, but we can live with such differences. The Lagrangian trajectories are highly nonlinear and chaotic. Any small perturbation may result in a different trajectory, which is the case for couple of particles. The perfect restart for floats requires us to save four different time level for the positions, trajectories, right-hand-side, and property fields. Also, the random walk complicates matters.

If I compare the floats NetCDF files for Exp 2, Exp 4, and Exp 6, they are also identical byte-by-byte. This is because the initialization during restart is identical regardless of the tile partition. This indicates that we don't have parallel bugs in the restarting of the floats.

I didn't have to change the dimension of the test problem (Lm=14, Mm=12). There is not need for such a thing. You can alter the configuration of this specific test. You need to look carefully how this test is configured inside ROMS. The floats are very modular and generic for any application. It is very unlikely to have a systematic design parallel bug for a particular application in the distributed code. Parallel bugs can be easily introduced when the user modifies their version code. We are not responsible for such bugs.

Exp 1, 1x1 no restart:

Exp 2, 1x1 restart at time-step 70:

Exp 3, 1x2 no restart:

Exp 4, 1x2 restart at time-step 70:

Exp 5, 2x2 no restart:

Exp 6, 2x2 restart at time-step 70:

I will highly recommend you to look your application in detail. Perhaps, you need to learn how to use an advanced graphical debugger. I have been using TotalView for more than 15 years. Also consider twice before posting so panicking messages about possible parallel bugs in the code. We try as best as we can to release our complex modeling system without parallel bugs. Several of us spent few hours trying to figure out what is wrong and cannot reproduce your problem. This indicates to us that you are maybe doing something wrong.

If you happen to discover and reporting a real bug in the future, we will be hesitant to invest time to see if we can reproduce it. We all are really busy and have very limited time to look at the problems that every user have in ROMS. I always try my best to check such bug reports. ROMS has thousands of users and I don't have the time to check all the possible bug reports from each user. We have a very elaborated system (svn and trac) for updating and documenting updates and bug fixes to the code. I always provided very detailed information. Every user receive the e-mails when I make changes to the repository. I know that many users tend to ignore such important update or bug corrections, it is our of my control.

arango · #26 Unread post by **arango** » Wed Sep 11, 2013 9:20 pm

We discovered a problem with the floats restart. See the following

ticket for more information. Many thanks to Diego Narvaez for providing excellent clues.

Ocean Modeling Discussion

ROMS restart and floats

ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats

Re: ROMS restart and floats