cannot restart from rst file

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
nacholibre
Posts: 81
Joined: Thu Dec 07, 2006 3:14 pm
Location: USGS
Contact:

cannot restart from rst file

#1 Unread post by nacholibre »

I am using MPI with 20 nodes and when I try to restart from rst.nc file I get some error related to "Free control buffers" from each node. The model runs for for 15 minutes or so before blowing up, but it does not write any history file or time steps into the output file. I have attached the output file. Also the error message from the system is as shown below. I have been trying to solve this for the last couple of days but I haven't done a restart in MPI before and any comments will be very helpful.
I have set NREC=-1 to start from the final time step in rst.nc file. I cannot restart from the his file because I have not written all of the variables in those. I tried with up to 4gb memory at each node, thinking that it might be a memory issue, but it did not help. "ulimit -a" displays

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 40960
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Code: Select all

oceanM: posixio.c:412: px_get: Assertion `extent != 0' failed.

oceanM:7418 terminated with signal 6 at PC=31c8f2e21d SP=7fbfffe488.  Backtrace:
/lib64/tls/libc.so.6(gsignal+0x3d)[0x31c8f2e21d]
/lib64/tls/libc.so.6(abort+0xfe)[0x31c8f2fa1e]
/lib64/tls/libc.so.6(__assert_fail+0xf1)[0x31c8f27ae1]
/storage01/home/gth834q/projects/ga31/code/./oceanM[0x7362f0]
MPIRUN: both MPI progress and Ping Quiescence Detected.
MPIRUN: 1 out of 20 ranks showed no MPI send or receive progress in 900 seconds.
MPIRUN: 1 out of 20 ranks could not return a ping request.
MPIRUN: Per-rank details are the following:
MPIRUN: Rank    0 (node010         ) caused both MPI progress and Ping Quiescence.
node010:2.Quiescence detected. Message Queues:
node010:3.Quiescence detected. Message Queues:
node009:6.Quiescence detected. Message Queues:
node006:17.Quiescence detected. Message Queues:
node006:18.Quiescence detected. Message Queues:
node007:12.Quiescence detected. Message Queues:
node008:8.Quiescence detected. Message Queues:
node008:11.Quiescence detected. Message Queues:
node010:1.Quiescence detected. Message Queues:
node008:9.Quiescence detected. Message Queues:
node009:4.Quiescence detected. Message Queues:
node007:14.Quiescence detected. Message Queues:
node009:7.Quiescence detected. Message Queues:
node009:5.Quiescence detected. Message Queues:
node006:16.Quiescence detected. Message Queues:
node007:15.Quiescence detected. Message Queues:
node006:19.Quiescence detected. Message Queues:
node008:10.Quiescence detected. Message Queues:
node007:13.Quiescence detected. Message Queues:
Thanks!
Zafer

PS.
I have also tried it in a serial mode. I get the this error

Code: Select all

oceanS: posixio.c:412: px_get: Assertion `extent != 0' failed.
instead of the old one

Code: Select all

node010:2.SHM Free control buffers: 2048
Attachments
roms_out.txt
(24.32 KiB) Downloaded 284 times
Last edited by nacholibre on Fri Jul 10, 2009 6:26 pm, edited 2 times in total.

jcwarner
Posts: 1200
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: cannot restart from rst file

#2 Unread post by jcwarner »

is that log out from the restart?
What did the log out from the first run look like?
try it without stations.

nacholibre
Posts: 81
Joined: Thu Dec 07, 2006 3:14 pm
Location: USGS
Contact:

Re: cannot restart from rst file

#3 Unread post by nacholibre »

There were several reasons for this:
The main problem was that the initialization file was missing. I had been overlooking it since ROMS strangely did not throw an error message for that.
Also it was not possible to start from the restart file since it is the final step recorded before the blowup, it always blows up in any restart. Therefore it is necessary to start from the history file. But then again in order to do that one needs to output the variables required for a restart in a history file.
Zafer

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: cannot restart from rst file

#4 Unread post by kate »

If your restart file has more than one record, it is possible to start from one that isn't the blowing up state. Rather than setting NRREC to -1 (which tells ROMS to read the last time), set it to 1 or 2.

Post Reply