tides_date.f90 variable definition
tides_date.f90 variable definition
Hello,
I'm getting an error in the startup process on a model that includes tides. The stack trace looks like this:
forrtl: severe (194): Run-Time Check Failure. The variable 'tides_date_$FOUNDIT' is being used in 'tides_date.f90(120,13)' without being defined
Image PC Routine Line Source
romsG 00000000024408B4 tides_date_ 120 tides_date.f90
romsG 00000000022CB301 read_phypar_ 2402 read_phypar.f90
romsG 0000000000E1EDEB inp_par_mod_mp_in 113 inp_par.f90
romsG 000000000040EBA1 roms_kernel_mod_m 99 roms_kernel.f90
romsG 00000000004115F0 MAIN__ 97 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F65F764CCF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
This is on a model instance that I'm trying to bring up to speed on the latest version of ROMS, I had previously been running version 3.8. I think I have correctly implemented what is suggested in this ticket: https://www.myroms.org/projects/src/ticket/896
I am attaching my .h, .in, error, and log, files. I can send my tidal forcing file if necessary.
Any suggestions on ways forward?
I'm getting an error in the startup process on a model that includes tides. The stack trace looks like this:
forrtl: severe (194): Run-Time Check Failure. The variable 'tides_date_$FOUNDIT' is being used in 'tides_date.f90(120,13)' without being defined
Image PC Routine Line Source
romsG 00000000024408B4 tides_date_ 120 tides_date.f90
romsG 00000000022CB301 read_phypar_ 2402 read_phypar.f90
romsG 0000000000E1EDEB inp_par_mod_mp_in 113 inp_par.f90
romsG 000000000040EBA1 roms_kernel_mod_m 99 roms_kernel.f90
romsG 00000000004115F0 MAIN__ 97 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F65F764CCF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
This is on a model instance that I'm trying to bring up to speed on the latest version of ROMS, I had previously been running version 3.8. I think I have correctly implemented what is suggested in this ticket: https://www.myroms.org/projects/src/ticket/896
I am attaching my .h, .in, error, and log, files. I can send my tidal forcing file if necessary.
Any suggestions on ways forward?
- Attachments
-
- roms_almirante.in
- (144.63 KiB) Downloaded 493 times
-
- almirante.h
- (3.09 KiB) Downloaded 507 times
-
- almirantelog.txt
- (10.88 KiB) Downloaded 510 times
-
- almirante_error.txt
- (194.29 KiB) Downloaded 519 times
Re: tides_date.f90 variable definition
Could you also attach your tides_date.f90 from your Build_romsG directory?
Re: tides_date.f90 variable definition
Yes, sorry, here it is.
Thanks!
Thanks!
- Attachments
-
- tides_date.f90
- (8.91 KiB) Downloaded 516 times
Re: tides_date.f90 variable definition
I believe the issue is that your roms_almirante.in file is out of date. Several new parameters were added since 3.8. In particular, you are missing INP_LIB and OUT_LIB. Compare your .in file with one of the roms_*.in files in the ROMS/External folder of your new source code and add any missing parameters.
Re: tides_date.f90 variable definition
Ah, of course. This was the issue! Thanks, David.
Re: tides_date.f90 variable definition
A new problem has arisen, though. This seems to be some sort of I/O error with openmpi... although it's occuring in the set_tides.f90 code. Here's the stack trace:
[hpc3-21-18][[59635,1],230][btl_openib_component.chandle_wc] Unhandled work completion opcode is 136
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
romsG 000000000246E7AB Unknown Unknown Unknown
libpthread-2.28.s 00007FEE7150ACE0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE717460D0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE7176DFAF ompi_request_defa Unknown Unknown
libmpi.so.40.20.3 00007FEE717C5D2F ompi_coll_base_se Unknown Unknown
libmpi.so.40.20.3 00007FEE717C85FD ompi_coll_base_al Unknown Unknown
libmpi.so.40.20.3 00007FEE717822EE PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00007FEE71AA8B59 mpi_allreduce_ Unknown Unknown
romsG 0000000001204B19 distribute_mod_mp 1604 distribute.f90
romsG 000000000066D7D7 set_tides_mod_mp_ 282 set_tides.f90
romsG 000000000064FA36 set_tides_mod_mp_ 65 set_tides.f90
romsG 000000000041761C main3d_ 187 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007FEE70F69CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
Any thoughts? If this would be better in a new thread/different subforum let me know. Attaching the set_tides.f90, distribute.f90, log, error, and new .in files.
[hpc3-21-18][[59635,1],230][btl_openib_component.chandle_wc] Unhandled work completion opcode is 136
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
romsG 000000000246E7AB Unknown Unknown Unknown
libpthread-2.28.s 00007FEE7150ACE0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE717460D0 Unknown Unknown Unknown
libmpi.so.40.20.3 00007FEE7176DFAF ompi_request_defa Unknown Unknown
libmpi.so.40.20.3 00007FEE717C5D2F ompi_coll_base_se Unknown Unknown
libmpi.so.40.20.3 00007FEE717C85FD ompi_coll_base_al Unknown Unknown
libmpi.so.40.20.3 00007FEE717822EE PMPI_Allreduce Unknown Unknown
libmpi_mpifh.so.4 00007FEE71AA8B59 mpi_allreduce_ Unknown Unknown
romsG 0000000001204B19 distribute_mod_mp 1604 distribute.f90
romsG 000000000066D7D7 set_tides_mod_mp_ 282 set_tides.f90
romsG 000000000064FA36 set_tides_mod_mp_ 65 set_tides.f90
romsG 000000000041761C main3d_ 187 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007FEE70F69CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
Any thoughts? If this would be better in a new thread/different subforum let me know. Attaching the set_tides.f90, distribute.f90, log, error, and new .in files.
- Attachments
-
- distribute.f90
- (222.42 KiB) Downloaded 487 times
-
- set_tides.f90
- (21.32 KiB) Downloaded 498 times
-
- roms_almirante.in
- (156.74 KiB) Downloaded 490 times
-
- almirante_mpi_io_problem.txt
- (1.53 KiB) Downloaded 496 times
-
- almirantelog.txt
- (150.16 KiB) Downloaded 506 times
Re: tides_date.f90 variable definition
Does the log really end abruptly like that, with no error reporting from ROMS?
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
Re: tides_date.f90 variable definition
Yes, strangely. I'm attaching the output from the compilation process here, in case that's of use.
- Attachments
-
- almirante_buildlog.txt
- (521.67 KiB) Downloaded 542 times
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tides_date.f90 variable definition
I think that the issue here is that something is missing in the configuration. Did you update varinfo.yaml? Mostly all users ignore trac updates to ROMS. We provide precise information in trac tickets with instructions on using new features. We usually see postings here that we need to reconstruct and guess the user source code and input scripts and NetCDF files.
Re: tides_date.f90 variable definition
Thanks for the suggestion--my .in file does include a path to the .yaml file, rather than a non-existent .dat file:
! Input variable information file name. This file needs to be processed
! first so all information arrays can be initialized properly.
VARNAME = /dfs6/pub/skastner/ROMS/trunk/ROMS/External/varinfo.yaml
This is indeed where the varinfo.yaml file is located.
Is this what you meant? I've tried to go back through the trac tickets as suggested, but I am not sure which tickets correspond to things I should change in my .in file. I have gone through the output of a "diff" command between my application .in file (roms_almirate.in, attached here) and the upwelling test case .in file (roms_upwelling.in, also attached here, which does not produce this error), and changed glaring differences between the two.
The other trac ticket that looked like it could influence my model was this, on LuvSrc/LwSrc: https://www.myroms.org/projects/src/ticket/905
I don't use LwSrc, though, only LuvSrc.
It is possible that some of my metadata conflicts between the .yaml file and my netcdf forcing files---I will check this. I do find it hard to believe that a metadata conflict could cause this kind of error, though.
! Input variable information file name. This file needs to be processed
! first so all information arrays can be initialized properly.
VARNAME = /dfs6/pub/skastner/ROMS/trunk/ROMS/External/varinfo.yaml
This is indeed where the varinfo.yaml file is located.
Is this what you meant? I've tried to go back through the trac tickets as suggested, but I am not sure which tickets correspond to things I should change in my .in file. I have gone through the output of a "diff" command between my application .in file (roms_almirate.in, attached here) and the upwelling test case .in file (roms_upwelling.in, also attached here, which does not produce this error), and changed glaring differences between the two.
The other trac ticket that looked like it could influence my model was this, on LuvSrc/LwSrc: https://www.myroms.org/projects/src/ticket/905
I don't use LwSrc, though, only LuvSrc.
It is possible that some of my metadata conflicts between the .yaml file and my netcdf forcing files---I will check this. I do find it hard to believe that a metadata conflict could cause this kind of error, though.
- Attachments
-
- roms_upwelling.in
- (154.63 KiB) Downloaded 495 times
-
- roms_almirante.in
- (156.74 KiB) Downloaded 483 times
Re: tides_date.f90 variable definition
One perplexing thing is the way in which your job exited. There are quite extensive error trapping routines in ROMS so that you get some information on where the code failed before it exits. Yours just stopped, judging by the log you posted. It did not indicate a BLOW UP, or give a line number for the point of failure. This somewhat points to a system problem rather than a code problem.
That said, I notice some things in your log. Your shallowest depth is 0.1 m, and the thinnest vertical layer 5 mm (!) thick. You might consider setting the minimum depth in your bathymetry to be a bit deeper. But if this were the problem, and a vertical CFL violation occurred, ROMS would have reported this.
You have a very large grid and a large number of processors (240). So you are computing across multiple nodes in a cluster. Can you test this model on a smaller number of cores?
Have you followed our advice that the libraries you link to are compiled with the same compiler (and version) you are using for ROMS itself?
That said, I notice some things in your log. Your shallowest depth is 0.1 m, and the thinnest vertical layer 5 mm (!) thick. You might consider setting the minimum depth in your bathymetry to be a bit deeper. But if this were the problem, and a vertical CFL violation occurred, ROMS would have reported this.
You have a very large grid and a large number of processors (240). So you are computing across multiple nodes in a cluster. Can you test this model on a smaller number of cores?
Have you followed our advice that the libraries you link to are compiled with the same compiler (and version) you are using for ROMS itself?
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
Re: tides_date.f90 variable definition
Thanks, John! I'll look into changing the bathymetry. I am using the same compilers to compile and for the run itself.
I'm going to run a few tests:
1.) Use a smaller number of cores
2.) Run in serial
3.) Use newer versions of openmpi and netcdf. Currently, I'm using openmpi 4.0.3 and netcdf 4.7.0. I have access to openmpi 4.1.2 and netcdf 4.8.1, which both use the 2022 version of ifort (I'm currently using 2020 for these modules).
4.) Change the bathymetry.
Do you think the version of openmpi matters between these two?
All that said, I'm re-running the model now as it was to replicate the error and it's not crashing, which is in some ways scarier than the error I was getting previously.
I'm going to run a few tests:
1.) Use a smaller number of cores
2.) Run in serial
3.) Use newer versions of openmpi and netcdf. Currently, I'm using openmpi 4.0.3 and netcdf 4.7.0. I have access to openmpi 4.1.2 and netcdf 4.8.1, which both use the 2022 version of ifort (I'm currently using 2020 for these modules).
4.) Change the bathymetry.
Do you think the version of openmpi matters between these two?
All that said, I'm re-running the model now as it was to replicate the error and it's not crashing, which is in some ways scarier than the error I was getting previously.
Re: tides_date.f90 variable definition
The fact that your earlier run terminated with no error report from ROMS does suggest it was a crash of one of the processors. You should ask your sysadmin if they have error logs to help diagnose a processor failure at the time your run crashed. All the more reason to test on fewer cores while you debug the set-up.All that said, I'm re-running the model now as it was to replicate the error and it's not crashing, which is in some ways scarier than the error I was getting previously.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu
Re: tides_date.f90 variable definition
Sorry for the slow response--I've tested the model with three amounts of processors: 1 node (40 cpus), 3 nodes (120 cpus), and 6 nodes (240 cpus). The 6 node run crashes as described above. The 1 node run is excruciatingly slow (~2x real time speed), and does not crash in the time I allowed it to run. The 3 node run is slightly faster (~5x real time speed, still not fast enough), and does not crash in the time I allowed it to run. It does throw this warning, though:
forrtl: warning (406): fort: (1): In call to MP_GATHER3D, an array temporary was created for argument #13
Image PC Routine Line Source
romsG 00000000024655EF Unknown Unknown Unknown
romsG 0000000000EA662C nf_fwrite4d_mod_m 183 nf_fwrite4d.f90
romsG 0000000000E0586A wrt_rst_mod_mp_wr 285 wrt_rst.f90
romsG 0000000000DF2F79 wrt_rst_mod_mp_wr 69 wrt_rst.f90
romsG 00000000004FB9C0 output_ 284 output.f90
romsG 0000000000419517 main3d_ 230 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F404E8C7CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
I'm attaching the full error file, the run timed out after 5 days.
This seems to be happening while writing the restart file? I don't think this is the same as what occurred above, so perhaps not related. It does have to do with the mpi setup, though.
forrtl: warning (406): fort: (1): In call to MP_GATHER3D, an array temporary was created for argument #13
Image PC Routine Line Source
romsG 00000000024655EF Unknown Unknown Unknown
romsG 0000000000EA662C nf_fwrite4d_mod_m 183 nf_fwrite4d.f90
romsG 0000000000E0586A wrt_rst_mod_mp_wr 285 wrt_rst.f90
romsG 0000000000DF2F79 wrt_rst_mod_mp_wr 69 wrt_rst.f90
romsG 00000000004FB9C0 output_ 284 output.f90
romsG 0000000000419517 main3d_ 230 main3d.f90
romsG 000000000040FEC5 roms_kernel_mod_m 191 roms_kernel.f90
romsG 0000000000411A4D MAIN__ 110 master.f90
romsG 000000000040E8E2 Unknown Unknown Unknown
libc-2.28.so 00007F404E8C7CF3 __libc_start_main Unknown Unknown
romsG 000000000040E7EE Unknown Unknown Unknown
I'm attaching the full error file, the run timed out after 5 days.
This seems to be happening while writing the restart file? I don't think this is the same as what occurred above, so perhaps not related. It does have to do with the mpi setup, though.
- Attachments
-
- slurmerror_romsv41_120cpu.txt
- (3.9 MiB) Downloaded 526 times
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: tides_date.f90 variable definition
ROMS will run much faster if the code is optimized (executable romsM) than with debugging flags (executable romsG).