errors when running Lagrangian floats in ROMS 3.9

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
goodboy

errors when running Lagrangian floats in ROMS 3.9

#1 Unread post by goodboy »

Hi, everyone.
I'm running the Lagrangian floats in my application under ROMS 3.9. I can compile them successfully. When I run it, there are some abnormal stops which I can't find out the reasons. Even if I open the debug mode, there is nothing to tell me the reasons.

Code: Select all

mpirun -np 4 ./LAGRANGIAN_deb ../input/roms_lagragian.in > log
In the log file, there are nothing error reasons, I have attached the log file in the end.
Besides, there are some error information in my screen.

Code: Select all

forrtl: severe (194): Run-Time Check Failure. The variable 'mod_floats_mp_allocate_floats_$NG' is being used in 'mod_floats.f90(143,18)' without being defined
Image              PC                Routine            Line        Source
LAGRANGIAN_deb     00000000029BEE59  mod_floats_mp_all         143  mod_floats.f90
LAGRANGIAN_deb     0000000002380A46  read_fltpar_               55  read_fltpar.f90
LAGRANGIAN_deb     0000000001CF8282  inp_par_                  569  inp_par.f90
LAGRANGIAN_deb     000000000040C9B6  ocean_control_mod          86  ocean_control.f90
LAGRANGIAN_deb     000000000040C207  MAIN__                     95  master.f90
LAGRANGIAN_deb     000000000040BF1E  Unknown               Unknown  Unknown
libc-2.17.so       00002B8CE6DD5C05  __libc_start_main     Unknown  Unknown
LAGRANGIAN_deb     000000000040BE29  Unknown               Unknown  Unknown
forrtl: severe (194): Run-Time Check Failure. The variable 'mod_floats_mp_allocate_floats_$NG' is being used in 'mod_floats.f90(143,18)' without being defined
Image              PC                Routine            Line        Source
LAGRANGIAN_deb     00000000029BEE59  mod_floats_mp_all         143  mod_floats.f90
LAGRANGIAN_deb     0000000002380A46  read_fltpar_               55  read_fltpar.f90
LAGRANGIAN_deb     0000000001CF8282  inp_par_                  569  inp_par.f90
LAGRANGIAN_deb     000000000040C9B6  ocean_control_mod          86  ocean_control.f90
LAGRANGIAN_deb     000000000040C207  MAIN__                     95  master.f90
LAGRANGIAN_deb     000000000040BF1E  Unknown               Unknown  Unknown
libc-2.17.so       00002B96CFD1EC05  __libc_start_main     Unknown  Unknown
LAGRANGIAN_deb     000000000040BE29  Unknown               Unknown  Unknown
forrtl: severe (194): Run-Time Check Failure. The variable 'mod_floats_mp_allocate_floats_$NG' is being used in 'mod_floats.f90(143,18)' without being defined
Image              PC                Routine            Line        Source
LAGRANGIAN_deb     00000000029BEE59  mod_floats_mp_all         143  mod_floats.f90
LAGRANGIAN_deb     0000000002380A46  read_fltpar_               55  read_fltpar.f90
LAGRANGIAN_deb     0000000001CF8282  inp_par_                  569  inp_par.f90
LAGRANGIAN_deb     000000000040C9B6  ocean_control_mod          86  ocean_control.f90
LAGRANGIAN_deb     000000000040C207  MAIN__                     95  master.f90
LAGRANGIAN_deb     000000000040BF1E  Unknown               Unknown  Unknown
libc-2.17.so       00002B3366659C05  __libc_start_main     Unknown  Unknown
LAGRANGIAN_deb     000000000040BE29  Unknown               Unknown  Unknown
forrtl: severe (194): Run-Time Check Failure. The variable 'mod_floats_mp_allocate_floats_$NG' is being used in 'mod_floats.f90(143,18)' without being defined
Image              PC                Routine            Line        Source
LAGRANGIAN_deb     00000000029BEE59  mod_floats_mp_all         143  mod_floats.f90
LAGRANGIAN_deb     0000000002380A46  read_fltpar_               55  read_fltpar.f90
LAGRANGIAN_deb     0000000001CF8282  inp_par_                  569  inp_par.f90
LAGRANGIAN_deb     000000000040C9B6  ocean_control_mod          86  ocean_control.f90
LAGRANGIAN_deb     000000000040C207  MAIN__                     95  master.f90
LAGRANGIAN_deb     000000000040BF1E  Unknown               Unknown  Unknown
libc-2.17.so       00002AE7F359FC05  __libc_start_main     Unknown  Unknown
LAGRANGIAN_deb     000000000040BE29  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20760,1],0]
  Exit code:    194
--------------------------------------------------------------------------
[admin:189133] 4 more processes have sent help message help-oob-ud.txt / create-qp-failed
[admin:189133] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[admin:189133] 4 more processes have sent help message help-oob-ud.txt / no-ports-usable
[admin:189133] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[admin:189133] 7 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
I have checked the mod_floats.f90

Code: Select all

137 !-----------------------------------------------------------------------
138 !  Lagrangian drifters parameters.
139 !-----------------------------------------------------------------------
140 !
141       IF (.not.Ldrifter) THEN
142         allocate ( Fprint(Ngrids) )
143         Dmem(ng)=Dmem(ng)+REAL(Ngrids,r8)
144         allocate ( frrec(Ngrids) )
145         Dmem(ng)=Dmem(ng)+REAL(Ngrids,r8)
146         allocate ( ifTvar(MT) )
147         Dmem(ng)=Dmem(ng)+REAL(MT,r8)
148       END IF
149 !
I guess the reason is that there isn't using the mod_parallel in the mod_floats.f90 file, but I'm not sure. Could anyone help me ?
Attachments
log.log
(13.47 KiB) Downloaded 346 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: errors when running Lagrangian floats in ROMS 3.9

#2 Unread post by wilkin »

If you try this with the myroms.org version of the code more of us might be able to help you.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

goodboy

Re: errors when running Lagrangian floats in ROMS 3.9

#3 Unread post by goodboy »

I have tried this application in ROMS 3.8, which is downloaded from www.myroms.org. It also blows up after the first timestep. I open the debug mode to see what happened. It tells me that there is something about my input NetCDF file——both input and output file. Could you please help me find out this error?
In the file log.log, there is some information about my command line and running output errors.
Attachments
log.log
(3.53 MiB) Downloaded 332 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: errors when running Lagrangian floats in ROMS 3.9

#4 Unread post by wilkin »

You log file indicates the problems with reading initial conditions are all at line 208 and 285 of initial.f90. So, see what ROMS is trying to read at that line.

But your errors seem more like a fundamental problem with netcdf. Is your netcdf library built with the same compiler you are using for ROMS? Check with your system admin.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

User avatar
arango
Site Admin
Posts: 1367
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: errors when running Lagrangian floats in ROMS 3.9

#5 Unread post by arango »

You need to use any older or the latest versions of our distributed code. I checked the trac history for the mod_floats.F module and I cannot find the offending code in the last five years!

Anyway, you need to read and understand the error message:
forrtl: severe (194): Run-Time Check Failure. The variable 'mod_floats_mp_allocate_floats_$NG' is being used in 'mod_floats.f90(143,18)' without being defined
It is as clear as water. It says that in routine allocate_floats of the module mod_floats there is the variable ng that it is being used without being declared. That's what the Fortran error string _$NG means for variable. It also gives you the line and column of the C-preprocessing file mod_floats.f90. Every Fortran user needs to become familiar with the syntax in the reported errors.

In our version, we don't have such code shown in lines 143, 145, 147:

Code: Select all

137 !-----------------------------------------------------------------------
138 !  Lagrangian drifters parameters.
139 !-----------------------------------------------------------------------
140 !
141       IF (.not.Ldrifter) THEN
142         allocate ( Fprint(Ngrids) )
143         Dmem(ng)=Dmem(ng)+REAL(Ngrids,r8)
144         allocate ( frrec(Ngrids) )
145         Dmem(ng)=Dmem(ng)+REAL(Ngrids,r8)
146         allocate ( ifTvar(MT) )
147         Dmem(ng)=Dmem(ng)+REAL(MT,r8)
148       END IF

goodboy

Re: errors when running Lagrangian floats in ROMS 3.9

#6 Unread post by goodboy »

You log file indicates the problems with reading initial conditions are all at line 208 and 285 of initial.f90. So, see what ROMS is trying to read at that line.
In my ROMS 3.9 source code, there are some source code in the initial.f90

Code: Select all

195 !=======================================================================
196 !  Initialize model state variables and forcing.  This part is
197 !  executed for each ensemble/perturbation/iteration run.
198 !=======================================================================
199 !
200 !-----------------------------------------------------------------------
201 !  Set primitive variables initial conditions.
202 !-----------------------------------------------------------------------
203 !
204 !  Read in initial conditions from initial NetCDF file.
205 !
206       DO ng=1,Ngrids
207 !$OMP MASTER
208         CALL get_state (ng, iNLM, 1, INI(ng)%name,                      &
209      &                  IniRec(ng), Tindex(ng))
210 !$OMP END MASTER
211         CALL mp_bcasti (ng, iNLM, exit_flag)
212 !$OMP BARRIER
213         IF (FoundError(exit_flag, NoError, 561,                    &
214      &                 "ROMS/Nonlinear/initial.F")) RETURN
215         time(ng)=io_time                     ! needed for shared-memory
216       END DO

Code: Select all

263 !  If applicable, close all input boundary, climatology, and forcing
264 !  NetCDF files and set associated parameters to the closed state. This
265 !  step is essential in iterative algorithms that run the full TLM
266 !  repetitively. Then, Initialize several parameters in their file
267 !  structure, so the appropriate input single or multi-file is selected
268 !  during initialization/restart.
269 !
270       DO ng=1,Ngrids
271 !$OMP MASTER
272         CALL close_inp (ng, iNLM)
273         CALL check_multifile (ng, iNLM)
274 !$OMP END MASTER
275         CALL mp_bcasti (ng, iNLM, exit_flag)
276 !$OMP BARRIER
277         IF (FoundError(exit_flag, NoError, 818,                    &
278      &                 "ROMS/Nonlinear/initial.F")) RETURN
279       END DO
280 !
281 !  If applicable, read in input data.
282 !
283       DO ng=1,Ngrids
284 !$OMP MASTER
285         CALL get_idata (ng)
286         CALL get_data (ng)
287 !$OMP END MASTER
288         CALL mp_bcasti (ng, iNLM, exit_flag)
289 !$OMP BARRIER
290         IF (FoundError(exit_flag, NoError, 833,                    &
291      &                 "ROMS/Nonlinear/initial.F")) RETURN
292       END DO
I've checked the log file, It tells me that the reason is abnormal temperature and speed after first timestep.

Code: Select all

 DIAG speed trouble    130.716602029650     
 DIAG temperature trouble   -63264.4131214336                1
 DIAG speed ijk    125.230723764701              129          73          25
           1
 Found Error: 01   Line: 332      Source: ROMS/Nonlinear/main3d.F
 Found Error: 01   Line: 303      Source: ROMS/Drivers/nl_ocean.h

 Blowing-up: Saving latest model state into  RESTART file
     REASON: MaxDensity =  8.2E+01
While there isn't any NaN or abnormal value in my initial file. I get troubled.
I have attached the initial file title and the log file in the end.
But your errors seem more like a fundamental problem with netcdf. Is your netcdf library built with the same compiler you are using for ROMS? Check with your system admin.
The netcdf library is installed by myself. I can run it successfully in last year. My netcdf library is compiled with mpif90, and in my ROMS application, I use openmpi and ifort, which is the same compiler with the mpif90.
In our version, we don't have such code shown in lines 143, 145, 147:
I have checked my ROMS 3.8 and 3.9 source code, there actually aren't the three lines in ROMS 3.8. Therefore, I have annotated the three lines, and now it can run without that error.
While there is another error, which blows up after the first timestep, just like I mentioned before.
Attachments
ROMS3_9.log
(6.99 MiB) Downloaded 394 times
inifile.txt
(29.59 KiB) Downloaded 406 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: errors when running Lagrangian floats in ROMS 3.9

#7 Unread post by wilkin »

You have
You are running Kate's code, so I can't be entirely sure my experience will help. And if you are mixing and matching float routines from myroms.org with Kate's then I'm not sure if that's a problem.

But looking at the logfile I have a few remarks.

First, don't all the system errors at the start worry you?
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: admin
Requested max number of outstanding WRs in the SQ: 1
Requested max number of outstanding WRs in the RQ: 2
Requested max number of SGEs in a WR in the SQ: 511
Requested max number of SGEs in a WR in
And in the middle of the start-up:
[admin:37984] 9 more processes have sent help message help-oob-ud.txt / create-qp-failed
[admin:37984] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[admin:37984] 9 more processes have sent help message help-oob-ud.txt / no-ports-usable
If it were me I'd want to get to the bottom of those before proceeding.

In the logfile report itself you have some strange data.
GET_NGFLD - tidal period
(Grid = 01, Min = 0.00000000E+00 Max = 0.00000000E+00)
All you tidal periods are zero. Which means the frequencies are infinite. All your cosine(omega*t) terms might be crazy.
GET_2DFLD - solar shortwave radiation flux, 2016-04-09 18:00:00.00
(Grid=01, Rec=0000384, Index=1, File: swrad_2016.nc)
(Tmin= 16801.0000 Tmax= 17166.7500) t = 16900.7500
(Min = -1.48765379E-87 Max = 2.46533933E-87) regrid = T
E-87 (!) for radiation. Why is that?
GET_2DFLD - sea surface temperature climatology, 2016-03-15 00:00:00.00
(Grid=01, Rec=0000003, Index=1, File: SST_2016.nc)
(Tmin= 16815.0000 Tmax= 17150.0000) t = 16875.0000
(Min = -6.21509557E+00 Max = 2.57639013E+01) regrid = F
Your climatology has SST = -6.2.
GET_2DFLD - sea surface salinity climatology, 2016-03-16 12:00:00.00
(Grid=01, Rec=0000003, Index=1, File: SSS_2016.nc)
(Tmin= 16816.5000 Tmax= 17151.5000) t = 16876.5000
(Min = -1.10097628E+01 Max = 4.46992666E+01) regrid = F
Your salinity is -11.

I notice you have defined TS_DIF2 amnd UV_VIS2 but ...
0.0000E+00 nl_tnu2(01) NLM Horizontal, harmonic mixing coefficient
(m2/s) for tracer 01: temp
0.0000E+00 nl_tnu2(02) NLM Horizontal, harmonic mixing coefficient
(m2/s) for tracer 02: salt
0.0000E+00 nl_visc2 NLM Horizontal, harmonic mixing coefficient
(m2/s) for momentum.
... all your mixing and friction coefficients are zero, so those terms do nothing.

I don't think floats are your problem.

Start simple to get a run going and turn on things like floats later.

And perhaps turn off SCORRECTION and QCORRECTION to see if those crazy surface salinity and temperature values are the culprit.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

goodboy

Re: errors when running Lagrangian floats in ROMS 3.9

#8 Unread post by goodboy »

And if you are mixing and matching float routines from myroms.org with Kate's then I'm not sure if that's a problem.
I don't mixing the code in ROMS 3.8 and 3.9. I'm running the application just under ROMS 3.9.
First, don't all the system errors at the start worry you?
And in the middle of the start-up:
It doesn't matter. It's some warning information about my compiler under my computer. You can ignore this warning. I can run it successfully before.
In the logfile report itself you have some strange data.
I have checked my input data, and there is actually wrong data in my input tide file, SST and SSS climatology file. I have changed it, which is correct after my changes.
E-87 (!) for radiation. Why is that?
This is because the solar shortwave radiation flux is 0 during night. My application is in the East China Sea, so the solar radiation is 0 between 18:00 and 6:00 in the next day (UTC). The data is correct.
... all your mixing and friction coefficients are zero, so those terms do nothing.
I have given it the suitable value. Even all the changes have been made, I can't run it successfully, It gives me the same wrong information. It blows up after the first timestep.
And perhaps turn off SCORRECTION and QCORRECTION to see if those crazy surface salinity and temperature values are the culprit.
I have turned off these two options. I can't run it successfully. It blows up after the first timestep because of the abnormal temperature and speed data. I have attached the new logfile in the end, which turns off above two options.

Now I know the problem isn't related to the floats. It troubled me a few weeks. Maybe I should ask this question again and change it to another title.
Attachments
ROMS3_9.log
(6.99 MiB) Downloaded 352 times

User avatar
wilkin
Posts: 922
Joined: Mon Apr 28, 2003 5:44 pm
Location: Rutgers University
Contact:

Re: errors when running Lagrangian floats in ROMS 3.9

#9 Unread post by wilkin »

Iny our logfile:
GET_2DFLD - surface air pressure, 2016-04-10 06:00:00.00
(Grid=01, Rec=0000386, Index=1, File: Pair_2016.nc)
(Tmin= 16801.0000 Tmax= 17166.7500) t = 16901.2500
(Min = 7.66470933E+00 Max = 1.02419869E+01) regrid = T
Air pressure should be given in millibar (or converted with a scale factor in varinfo.dat). These are not millibar.

GET_2DFLD - surface air relative humidity, 2016-04-10 06:00:00.00
(Grid=01, Rec=0000386, Index=1, File: Qair_2016.nc)
(Tmin= 16801.0000 Tmax= 17166.7500) t = 16901.2500
(Min = 6.23252507E+00 Max = 1.00000000E+02) regrid = T
Humidity reported here should have values in the range 0 to 1. varinfo.dat indicates humidity is to be given in percentage but converted by a scale factor of 0.01. Maybe Kate's code or varinfo.dat is modified to accept different units, but at this reporting step in the log I think the numbers should still have ended up in the range 0 - 1.

Please move further discussion on this to a new thread once you have verified that you are providing all inputs in the correct units.
John Wilkin: DMCS Rutgers University
71 Dudley Rd, New Brunswick, NJ 08901-8521, USA. ph: 609-630-0559 jwilkin@rutgers.edu

goodboy

Re: errors when running Lagrangian floats in ROMS 3.9

#10 Unread post by goodboy »

I have changed both Pair and Qair, the reason is that the scalar factor is different in ROMS 3.9 and ROMS 3.8. I have changed them. Now I can run for about one day. However, it blows up after 2280 timesteps because of the NaN value.

Code: Select all

      2280 2016-04-11 14:00:00.00           NaN           NaN           NaN           NaN
                     (000,000,00)  0.000000E+00  0.000000E+00  0.000000E+00           NaN
 Found Error: 01   Line: 332      Source: ROMS/Nonlinear/main3d.F
 Found Error: 01   Line: 303      Source: ROMS/Drivers/nl_ocean.h

 Blowing-up: Saving latest model state into  RESTART file
     REASON: KEchar =      NaN, PEchar =      NaN

      WRT_RST     - wrote re-start    fields (Index=1,1) in record = 0000002
I have attached the log file in the end, Could you please help me? What's the reason about the NaN value of PE and KE. I have checked the re-start file, I can't find the abnormal value.
Attachments
LAGRANGIAN.log
(136.48 KiB) Downloaded 399 times

goodboy

Re: errors when running Lagrangian floats in ROMS 3.9

#11 Unread post by goodboy »

I have solved this problem already. The NaN value of PE (potential energy) and KE(kinetic energy) is really tricky. It's the problem with the CFL criteria. In my application, the DT is the 60s. The spatial resolution is about 2.5km in the coastal region and 8km in the open ocean. The timestep is too long for this spatial resolution. Now I change it to 30s. My application can run successfully.
It's my pleasure that you and arango. Thank you very much!

Post Reply