I'm facing a similar problem here, although I'm not using perfect restart but a normal restart (many times) from the *rst.nc files. But I found that ROMS BLOWS UP WHEN RESTARTING MANY TIMES in a row!!! And does not blows up when ran continuously!
Due to a restricted allocation run-time in our supercomputing facility, I planned to run my ROMS simulations in small chunks and restart the simulation many times in a cycle. In that way, if my allocation time runs-out, then I will restart from my latest saved set of files. So I did a test of my submission script and I found that ROMS does not like restarts very much. ROMS BLOWS UP WHEN RESTARTING MANY TIMES!!! Why is this happening???
As in the above comments of this thread, I'm using WET_DRY (needed because of my huge tidal range in the forcing). But restarting from the common RST files (and not from perfect restart files). For this test, my small chunks of simulation are of 360 time steps (covering ~1hour in total as my time step is ~10s {DT=10.003168946726626d0}). I restart the simulation many times trying to cover a 24hours total simualtion time (24 restarts).
But doing this cycle of restarts, the simulations blows up. It runs fine for 8 restarts but it blows up at the 9th. Here is the output close to the blow up:
Code: Select all
NL ROMS/TOMS: started time-stepping: (Grid: 01 TimeSteps: 00089281 - 00089640)
STEP Day HH:MM:SS KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME
C => (i,j,k) Cu Cv Cw Max Speed
89280 282 08:04:42 2.479003E-02 2.410173E+02 2.410421E+02 7.148928E+11
(048,063,35) 6.320714E-03 5.322611E-02 0.000000E+00 3.661505E+00
89281 282 08:04:52 2.552012E-02 2.446391E+02 2.446647E+02 7.044243E+11
(104,096,40) 1.590393E-02 1.424100E-02 1.607916E+01 3.657581E+00
89282 282 08:05:02 2.558147E-02 2.446433E+02 2.446689E+02 7.044514E+11
(104,096,40) 1.456755E-02 1.269888E-02 1.670602E+01 3.403521E+00
89283 282 08:05:12 2.568717E-02 2.446475E+02 2.446732E+02 7.044785E+11
(104,096,40) 1.321542E-02 1.074954E-02 1.536749E+01 3.359425E+00
89284 282 08:05:22 2.581073E-02 2.446517E+02 2.446775E+02 7.045057E+11
(104,096,40) 1.249286E-02 9.867474E-03 1.357870E+01 3.363641E+00
89285 282 08:05:32 2.593870E-02 2.446560E+02 2.446819E+02 7.045330E+11
(104,094,40) 1.006534E-02 1.366681E-03 1.317981E+01 3.370461E+00
89286 282 08:05:42 2.606966E-02 2.446602E+02 2.446862E+02 7.045604E+11
(104,096,40) 1.365864E-02 1.061730E-02 1.246750E+01 3.380622E+00
89287 282 08:05:52 2.620533E-02 2.446644E+02 2.446906E+02 7.045879E+11
(104,096,40) 1.514322E-02 1.154090E-02 1.387893E+01 3.390705E+00
89288 282 08:06:02 2.634384E-02 2.446687E+02 2.446950E+02 7.046154E+11
(104,096,40) 1.615597E-02 1.085659E-02 1.507718E+01 3.400685E+00
89289 282 08:06:12 2.648552E-02 2.446729E+02 2.446994E+02 7.046430E+11
(104,096,40) 1.762328E-02 1.227998E-02 1.555833E+01 3.410534E+00
89290 282 08:06:22 2.662899E-02 2.446772E+02 2.447038E+02 7.046707E+11
(104,096,40) 1.879293E-02 1.243754E-02 1.690179E+01 3.419896E+00
89291 282 08:06:32 2.677449E-02 2.446815E+02 2.447083E+02 7.046985E+11
(104,096,40) 2.014827E-02 1.384759E-02 1.836597E+01 3.428964E+00
89292 282 08:06:42 2.692095E-02 2.446858E+02 2.447127E+02 7.047264E+11
(104,096,40) 2.098085E-02 1.419100E-02 1.982157E+01 3.437324E+00
89293 282 08:06:52 2.706869E-02 2.446901E+02 2.447172E+02 7.047544E+11
(104,096,40) 2.135433E-02 1.511572E-02 2.133009E+01 3.444007E+00
89294 282 08:07:02 2.721721E-02 2.446944E+02 2.447216E+02 7.047824E+11
(104,096,40) 2.097109E-02 1.159407E-02 2.162402E+01 5.839296E+00
89295 282 08:07:12 2.740157E-02 2.446988E+02 2.447262E+02 7.048105E+11
(104,096,40) 1.986952E-02 5.070485E-02 1.939949E+01 4.035932E+01
Blowing-up: Saving latest model state into RESTART file
WRT_RST - wrote re-start fields (Index=2,2) into time record = 0000001
Elapsed CPU time (seconds):
Node # 0 CPU: 10.527
Node # 3 CPU: 10.581
Node # 1 CPU: 10.588
Node # 2 CPU: 10.586
Node # 5 CPU: 10.585
Node # 4 CPU: 10.586
ROMS/TOMS - Output NetCDF summary for Grid 01:
number of time records written in RESTART file = 00000001
Analytical header files used:
ROMS/Functionals/ana_btflux.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_fsobc.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_m2obc.h
ROMS/Functionals/ana_nudgcoef.h
ROMS/Functionals/ana_smflux.h
ROMS/Functionals/ana_stflux.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_tobc.h
ROMS/TOMS: DONE... Tuesday - September 23, 2014 - 12:38:47 AM
But when running the similation continuously, it runs without any problem!!
As you can see from the output of the similar time steps:
Code: Select all
...
...
...
...
...
89280 282 08:04:42 2.670702E-02 2.446185E+02 2.446452E+02 7.042765E+11
(104,096,40) 1.498553E-02 1.352495E-02 1.397800E+01 3.641511E+00
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000006
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000002
89281 282 08:04:52 2.685490E-02 2.446227E+02 2.446496E+02 7.043038E+11
(104,096,40) 1.547758E-02 1.397703E-02 1.492934E+01 3.637908E+00
89282 282 08:05:02 2.700380E-02 2.446269E+02 2.446539E+02 7.043312E+11
(104,096,40) 1.575886E-02 1.421823E-02 1.567068E+01 3.634307E+00
89283 282 08:05:12 2.715340E-02 2.446311E+02 2.446583E+02 7.043587E+11
(104,096,40) 1.588478E-02 1.431352E-02 1.618525E+01 3.630708E+00
89284 282 08:05:22 2.730364E-02 2.446353E+02 2.446626E+02 7.043863E+11
(104,096,40) 1.590744E-02 1.431219E-02 1.654407E+01 3.627112E+00
89285 282 08:05:32 2.745461E-02 2.446396E+02 2.446670E+02 7.044140E+11
(104,096,40) 1.590422E-02 1.426623E-02 1.679340E+01 3.623517E+00
89286 282 08:05:42 2.760602E-02 2.446438E+02 2.446714E+02 7.044417E+11
(104,096,40) 1.587656E-02 1.417374E-02 1.701210E+01 3.619922E+00
89287 282 08:05:52 2.775840E-02 2.446481E+02 2.446758E+02 7.044695E+11
(104,096,40) 1.583423E-02 1.405226E-02 1.717540E+01 3.616327E+00
89288 282 08:06:02 2.791151E-02 2.446523E+02 2.446802E+02 7.044975E+11
(104,096,40) 1.578613E-02 1.395136E-02 1.730568E+01 3.612730E+00
89289 282 08:06:12 2.806538E-02 2.446566E+02 2.446847E+02 7.045255E+11
(104,096,40) 1.575341E-02 1.388557E-02 1.743891E+01 3.609130E+00
89290 282 08:06:22 2.822002E-02 2.446609E+02 2.446891E+02 7.045535E+11
(104,096,40) 1.575442E-02 1.384692E-02 1.760374E+01 3.605526E+00
89291 282 08:06:32 2.837534E-02 2.446652E+02 2.446936E+02 7.045817E+11
(104,096,40) 1.572365E-02 1.381274E-02 1.782701E+01 3.601916E+00
89292 282 08:06:42 2.853143E-02 2.446695E+02 2.446980E+02 7.046099E+11
(104,096,40) 1.567685E-02 1.378948E-02 1.802741E+01 3.598298E+00
89293 282 08:06:52 2.868806E-02 2.446738E+02 2.447025E+02 7.046382E+11
(104,096,40) 1.561686E-02 1.376412E-02 1.823463E+01 3.594670E+00
89294 282 08:07:02 2.884530E-02 2.446781E+02 2.447070E+02 7.046667E+11
(104,096,40) 1.557221E-02 1.374078E-02 1.843543E+01 3.591031E+00
89295 282 08:07:12 2.900351E-02 2.446825E+02 2.447115E+02 7.046951E+11
(104,096,40) 1.553922E-02 1.369591E-02 1.866165E+01 3.587378E+00
89296 282 08:07:22 2.916243E-02 2.446868E+02 2.447160E+02 7.047237E+11
(104,096,40) 1.547416E-02 1.362648E-02 1.889349E+01 3.583709E+00
89297 282 08:07:32 2.932207E-02 2.446912E+02 2.447205E+02 7.047524E+11
(104,096,40) 1.540072E-02 1.355317E-02 1.907962E+01 3.580023E+00
89298 282 08:07:42 2.948203E-02 2.446955E+02 2.447250E+02 7.047811E+11
(104,096,40) 1.532771E-02 1.347997E-02 1.925973E+01 3.576317E+00
89299 282 08:07:52 2.964292E-02 2.446999E+02 2.447296E+02 7.048099E+11
(104,096,40) 1.218503E-02 1.016348E-02 1.937232E+01 3.572591E+00
89300 282 08:08:02 2.980456E-02 2.447043E+02 2.447341E+02 7.048388E+11
(104,096,40) 1.277930E-02 1.083514E-02 1.319938E+01 3.568841E+00
...
...
...
...
...
...
95380 283 01:01:42 5.400885E-02 2.588579E+02 2.589119E+02 7.892415E+11
(048,064,01) 0.000000E+00 3.799400E-02 5.501194E-01 5.228957E+00
95381 283 01:01:52 5.378631E-02 2.588627E+02 2.589165E+02 7.892663E+11
(048,064,01) 0.000000E+00 3.798532E-02 5.499430E-01 5.227709E+00
95382 283 01:02:02 5.356422E-02 2.588676E+02 2.589212E+02 7.892909E+11
(048,064,01) 0.000000E+00 3.797675E-02 5.497681E-01 5.226459E+00
95383 283 01:02:12 5.334257E-02 2.588724E+02 2.589258E+02 7.893155E+11
(048,064,01) 0.000000E+00 3.796829E-02 5.495952E-01 5.225211E+00
95384 283 01:02:22 5.312137E-02 2.588773E+02 2.589304E+02 7.893399E+11
(048,064,01) 0.000000E+00 3.795997E-02 5.494243E-01 5.223966E+00
95385 283 01:02:32 5.290061E-02 2.588821E+02 2.589350E+02 7.893643E+11
(048,064,01) 0.000000E+00 3.795180E-02 5.492557E-01 5.222726E+00
95386 283 01:02:42 5.268031E-02 2.588869E+02 2.589395E+02 7.893886E+11
(048,064,01) 0.000000E+00 3.794378E-02 5.490897E-01 5.221494E+00
95387 283 01:02:52 5.246047E-02 2.588916E+02 2.589441E+02 7.894128E+11
(048,064,01) 0.000000E+00 3.793593E-02 5.489263E-01 5.220271E+00
95388 283 01:03:02 5.224107E-02 2.588964E+02 2.589486E+02 7.894368E+11
(048,064,01) 0.000000E+00 3.792825E-02 5.487656E-01 5.219059E+00
95389 283 01:03:12 5.202214E-02 2.589011E+02 2.589531E+02 7.894608E+11
(048,064,01) 0.000000E+00 3.792073E-02 5.486076E-01 5.217860E+00
95390 283 01:03:22 5.180366E-02 2.589058E+02 2.589576E+02 7.894847E+11
(048,064,01) 0.000000E+00 3.791339E-02 5.484524E-01 5.216674E+00
95391 283 01:03:32 5.158564E-02 2.589105E+02 2.589621E+02 7.895086E+11
(048,064,01) 0.000000E+00 3.790622E-02 5.482999E-01 5.215504E+00
95392 283 01:03:42 5.136808E-02 2.589152E+02 2.589666E+02 7.895323E+11
(048,064,01) 0.000000E+00 3.789921E-02 5.481500E-01 5.214350E+00
95393 283 01:03:52 5.115099E-02 2.589199E+02 2.589710E+02 7.895559E+11
(048,064,01) 0.000000E+00 3.789236E-02 5.480026E-01 5.213212E+00
95394 283 01:04:02 5.093436E-02 2.589245E+02 2.589755E+02 7.895794E+11
(048,064,01) 0.000000E+00 3.788566E-02 5.478576E-01 5.212092E+00
95395 283 01:04:12 5.071820E-02 2.589292E+02 2.589799E+02 7.896029E+11
(048,064,01) 0.000000E+00 3.787911E-02 5.477149E-01 5.210988E+00
95396 283 01:04:22 5.050250E-02 2.589338E+02 2.589843E+02 7.896262E+11
(048,064,01) 0.000000E+00 3.787269E-02 5.475742E-01 5.209902E+00
95397 283 01:04:32 5.028728E-02 2.589384E+02 2.589887E+02 7.896495E+11
(048,064,01) 0.000000E+00 3.786639E-02 5.474355E-01 5.208833E+00
95398 283 01:04:42 5.007253E-02 2.589430E+02 2.589930E+02 7.896726E+11
(048,064,01) 0.000000E+00 3.786021E-02 5.472985E-01 5.207779E+00
95399 283 01:04:52 4.985825E-02 2.589475E+02 2.589974E+02 7.896957E+11
(048,064,01) 0.000000E+00 3.785412E-02 5.471630E-01 5.206741E+00
95400 283 01:05:02 4.964444E-02 2.589521E+02 2.590017E+02 7.897187E+11
(048,064,01) 0.000000E+00 3.784813E-02 5.470288E-01 5.205717E+00
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000006
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000001
Elapsed CPU time (seconds):
Node # 0 CPU: 11946.184
Node # 17 CPU: 12230.091
Node # 18 CPU: 12222.168
Node # 19 CPU: 12216.437
Node # 5 CPU: 12192.117
Node # 6 CPU: 12188.503
Node # 7 CPU: 12174.581
Node # 9 CPU: 12181.620
Node # 20 CPU: 12213.519
Node # 4 CPU: 12182.725
Node # 1 CPU: 12176.599
Node # 2 CPU: 12177.346
Node # 3 CPU: 12180.974
Node # 11 CPU: 12194.746
Node # 8 CPU: 12181.300
Node # 10 CPU: 12189.401
Node # 21 CPU: 12212.763
Node # 22 CPU: 12217.321
Node # 23 CPU: 12225.852
Node # 12 CPU: 12227.724
Node # 13 CPU: 12224.131
Node # 14 CPU: 12212.782
Node # 15 CPU: 12217.012
Node # 16 CPU: 12220.612
Node # 43 CPU: 12221.638
Node # 44 CPU: 12216.134
Node # 45 CPU: 12217.062
Node # 46 CPU: 12220.611
Node # 36 CPU: 12223.850
Node # 47 CPU: 12233.105
Node # 37 CPU: 12218.730
Node # 38 CPU: 12222.902
Node # 39 CPU: 12221.564
Node # 40 CPU: 12215.394
Node # 41 CPU: 12228.349
Node # 42 CPU: 12227.992
Node # 27 CPU: 12218.353
Node # 28 CPU: 12222.798
Node # 29 CPU: 12226.374
Node # 30 CPU: 12230.767
Node # 31 CPU: 12223.091
Node # 32 CPU: 12218.498
Node # 33 CPU: 12218.167
Node # 34 CPU: 12214.387
Node # 35 CPU: 12228.949
Node # 24 CPU: 12223.831
Node # 25 CPU: 12216.456
Node # 26 CPU: 12212.922
ROMS/TOMS - Output NetCDF summary for Grid 01:
number of time records written in HISTORY file = 00000006
number of time records written in RESTART file = 00000002
Analytical header files used:
ROMS/Functionals/ana_btflux.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_fsobc.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_initial.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_m2obc.h
ROMS/Functionals/ana_nudgcoef.h
ROMS/Functionals/ana_smflux.h
ROMS/Functionals/ana_stflux.h
/scratch/partner658/espinosa/ROMS/670_my_own_testing/02_SS_JustM2_MassiveParticles_HydroTest1/Functionals/ana_tobc.h
ROMS/TOMS: DONE... Tuesday - September 23, 2014 - 12:30:34 AM
Any clue of why this restart problems are happening? I can share my simulation files if you want to reproduce the problem on your side. This is happening for version 3.6 (Built 670) and version 3.7 (built 737) as well.
Many thanks,
Alexis Espinosa