Pair and MPI version has problem with regridding
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Pair and MPI version has problem with regridding
Having problem with application where I am using Pair on atmo model native grid and *ONLY* MPI-parallel version of the latest code.
In serial version is regridding and working OK, but parallel version still has problem..
Trying to nail that, and by digging I think I've found bug in ./Modules/mod_forces.F line 509
should be like this as we are having 2 time steps in PairG:
# ifndef ANA_PAIR
allocate ( FORCES(ng) % PairG(LBi:UBi,LBj:UBj,2) )
Dmem(ng)=Dmem(ng)+2.0_r8*size2d
# endif
This is still not fixing parallel tile problem and Pair
Cheers
Ivica
In serial version is regridding and working OK, but parallel version still has problem..
Trying to nail that, and by digging I think I've found bug in ./Modules/mod_forces.F line 509
should be like this as we are having 2 time steps in PairG:
# ifndef ANA_PAIR
allocate ( FORCES(ng) % PairG(LBi:UBi,LBj:UBj,2) )
Dmem(ng)=Dmem(ng)+2.0_r8*size2d
# endif
This is still not fixing parallel tile problem and Pair
Cheers
Ivica
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Pair and MPI version has problem with regridding
Yes, the Dmem is a diagnostic quantity to estimate the memory requirement for an application. It has nothing to do with the numerical kernel.
It is in my TODO list to look at your problem in the debugger. The issues that you are talking about sound like a parallel bug. However, the regrid subroutine is generic for all variables. It is impossible to interpolate correctly for all the other forcing variables and not for Pair. It doesn't make sense.
The problem must be somewhere else. The fact that it is happening when you use the new option PRESS_COMPENSATE, it tells me that a parallel exchange is missing for Pair. Notice that the pressure is averaged at U- and V-points in u2dbc_im.F and v2dbc_im.F. I need to look what it is going on in set_2dfield.F when Pair is time interpolated from snapshots. The MPI exchange is always done at the bottom of the subroutine.
It is in my TODO list to look at your problem in the debugger. The issues that you are talking about sound like a parallel bug. However, the regrid subroutine is generic for all variables. It is impossible to interpolate correctly for all the other forcing variables and not for Pair. It doesn't make sense.
The problem must be somewhere else. The fact that it is happening when you use the new option PRESS_COMPENSATE, it tells me that a parallel exchange is missing for Pair. Notice that the pressure is averaged at U- and V-points in u2dbc_im.F and v2dbc_im.F. I need to look what it is going on in set_2dfield.F when Pair is time interpolated from snapshots. The MPI exchange is always done at the bottom of the subroutine.
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Re: Pair and MPI version has problem with regridding
Hernan,
I know you are quite busy and that this bug is on your todo list,
this post was more to have others aware of the problem in the latest version of ROMS code (if they use MPI and Pair as I am).
Using in addition (to ATM_PRESS) option for bry pressure correction --> PRESS_COMPENSATE doesn't change anything.
I am confused with regridding as well, sustr/svstr is regridded OK just having problem with Pair (!),
smells like wrong memory allocation? Do not have totalview so stuck here.
For example, serial ROMS Pair field at first time step:
MPI version of ROMS and the same Pair at first time step:
MPI version of ROMS and sustr which is OK:
Thanks for your time !
Ivica
I know you are quite busy and that this bug is on your todo list,
this post was more to have others aware of the problem in the latest version of ROMS code (if they use MPI and Pair as I am).
Using in addition (to ATM_PRESS) option for bry pressure correction --> PRESS_COMPENSATE doesn't change anything.
I am confused with regridding as well, sustr/svstr is regridded OK just having problem with Pair (!),
smells like wrong memory allocation? Do not have totalview so stuck here.
For example, serial ROMS Pair field at first time step:
MPI version of ROMS and the same Pair at first time step:
MPI version of ROMS and sustr which is OK:
Thanks for your time !
Ivica
Re: Pair and MPI version has problem with regridding
what happens if you do MPI with one processor?
i can look at the press compensate, we want to use that for our hurricane simulations.
-j
i can look at the press compensate, we want to use that for our hurricane simulations.
-j
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Pair and MPI version has problem with regridding
I took a look in the debugger on our US East Coast application and I cannot find anything wrong. I activated both ATM_PRESS and PRESS_COMPENSATE. I am also using BULK_FLUXES, which also need Pair. I don't see a parallel bug.
The Pair is kind of jagged but it is because of the coarse resolution of the NCEP dataset. I cannot reproduce your parallel problem. I am clueless about what it is going on your application. Are you also activating BULK_FLUXES? What is the range of your longitude in the Pair data?
The Pair is kind of jagged but it is because of the coarse resolution of the NCEP dataset. I cannot reproduce your parallel problem. I am clueless about what it is going on your application. Are you also activating BULK_FLUXES? What is the range of your longitude in the Pair data?
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Re: Pair and MPI version has problem with regridding
John,
tried mpirun -np 1 and it is working OK, gives identical result as serial run.
It is southern hemisphere system, NO BULK_FLUX, only storm surge case with surface wind stress and pressure.
Reading of original data is OK as well (in all ceases, MPI or serial), reasonable values within range.
Will try other constellations i.e. different NX * NY
tried mpirun -np 1 and it is working OK, gives identical result as serial run.
It is southern hemisphere system, NO BULK_FLUX, only storm surge case with surface wind stress and pressure.
Reading of original data is OK as well (in all ceases, MPI or serial), reasonable values within range.
Will try other constellations i.e. different NX * NY
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Re: Pair and MPI version has problem with regridding
It is getting even more interesting;
it is working for certain tile (2x2, 3x2) configuration and crashing for 6x4, and then working for 32, 36, 48 but having wrong pressure as I wrote.
I recompiled roms with debug option and gfortran + openmpi and bomb turns to be in inp_par.f90 which in line 77 is the one with IF statement complaining about logical statement of type kind=4:
!-----------------------------------------------------------------------
! Set lower and upper bounds indices per domain partition for all
! nested grids.
!-----------------------------------------------------------------------
!
! Determine the number of ghost-points in the halo region.
!
NghostPoints=2
IF (ANY(CompositeGrid).or.ANY(RefinedGrid)) THEN
NghostPoints=MAX(3,NghostPoints)
END IF
!
The error:
inp_par.f90:77: runtime error: load of null pointer of type 'logical(kind=4)'
ASAN:DEADLYSIGNAL
=================================================================
==2257==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x564f18654538 bp 0x7ffff1babab0 sp 0x7ffff1ba7d50 T0)
==2257==The signal is caused by a READ memory access.
==2257==Hint: address points to the zero page.
it is working for certain tile (2x2, 3x2) configuration and crashing for 6x4, and then working for 32, 36, 48 but having wrong pressure as I wrote.
I recompiled roms with debug option and gfortran + openmpi and bomb turns to be in inp_par.f90 which in line 77 is the one with IF statement complaining about logical statement of type kind=4:
!-----------------------------------------------------------------------
! Set lower and upper bounds indices per domain partition for all
! nested grids.
!-----------------------------------------------------------------------
!
! Determine the number of ghost-points in the halo region.
!
NghostPoints=2
IF (ANY(CompositeGrid).or.ANY(RefinedGrid)) THEN
NghostPoints=MAX(3,NghostPoints)
END IF
!
The error:
inp_par.f90:77: runtime error: load of null pointer of type 'logical(kind=4)'
ASAN:DEADLYSIGNAL
=================================================================
==2257==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x564f18654538 bp 0x7ffff1babab0 sp 0x7ffff1ba7d50 T0)
==2257==The signal is caused by a READ memory access.
==2257==Hint: address points to the zero page.
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Re: Pair and MPI version has problem with regridding
Not sure if I am right, but think the problem is in new version of regrid and new variable "myxout"
After compiling in debug mode (with mpi) I managed to trap it for 6x6 tile configuration (4x4 and 8x8 works?!):
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 335
Error termination. Backtrace:
At line 155 of file regrid.f90
Fortran runtime error: Index '171' of dimension 1 of array 'myxout' above upper bound of 170
Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 167
Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 671
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 503
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 319
Error termination. Backtrace:
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399
Error termination. Backtrace:
At line 155 of file regrid.f90#0 0x7ff967e11d1d in ???
#1 0x7ff967e12825 in ???
#2 0x7ff967e12bca in ???
#0 0x7fe6e95fcd1d in ???
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399
Error termination. Backtrace:
#3 0x55f7829eb993 in regrid_
at /home/ivica/NORTH_TC/Build/regrid.f90:155
#4 0x55f7829cc6df in __nf_fread2d_mod_MOD_nf_fread2d
at /home/ivica/NORTH_TC/Build/nf_fread2d.f90:309
#5 0x55f782670b2d in get_2dfld_
at /home/ivica/NORTH_TC/Build/get_2dfld.f90:227
#6 0x55f7823095da in get_data_
at /home/ivica/NORTH_TC/Build/get_data.f90:95
#7 0x55f782230117 in initial_
at /home/ivica/NORTH_TC/Build/initial.f90:229
#8 0x55f781e11ee2 in __ocean_control_mod_MOD_roms_initialize
at /home/ivica/NORTH_TC/Build/ocean_control.f90:133
#9 0x55f781e0e43d in ocean
at /home/ivica/NORTH_TC/Build/master.f90:95
#10 0x55f781e0eab2 in main
at /home/ivica/NORTH_TC/Build/master.f90:50
and so on....
Error termination. Backtrace:
and regrid.f90 155 line is MyXout(i,j)=Xout(i,j) :
DO j=Jmin,Jmax
DO i=Imin,Imax
MyXout(i,j)=Xout(i,j) ! range [-180 180]
END DO
END DO
If you want I can put the example on my server so you can grab it.
Thanks!
Ivica
After compiling in debug mode (with mpi) I managed to trap it for 6x6 tile configuration (4x4 and 8x8 works?!):
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 335
Error termination. Backtrace:
At line 155 of file regrid.f90
Fortran runtime error: Index '171' of dimension 1 of array 'myxout' above upper bound of 170
Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 167
Error termination. Backtrace:
At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90At line 155 of file regrid.f90
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 671
Fortran runtime error: Index '1' of dimension 1 of array 'myxout' below lower bound of 503
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 319
Error termination. Backtrace:
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399
Error termination. Backtrace:
At line 155 of file regrid.f90#0 0x7ff967e11d1d in ???
#1 0x7ff967e12825 in ???
#2 0x7ff967e12bca in ???
#0 0x7fe6e95fcd1d in ???
Fortran runtime error: Index '0' of dimension 2 of array 'myxout' below lower bound of 399
Error termination. Backtrace:
#3 0x55f7829eb993 in regrid_
at /home/ivica/NORTH_TC/Build/regrid.f90:155
#4 0x55f7829cc6df in __nf_fread2d_mod_MOD_nf_fread2d
at /home/ivica/NORTH_TC/Build/nf_fread2d.f90:309
#5 0x55f782670b2d in get_2dfld_
at /home/ivica/NORTH_TC/Build/get_2dfld.f90:227
#6 0x55f7823095da in get_data_
at /home/ivica/NORTH_TC/Build/get_data.f90:95
#7 0x55f782230117 in initial_
at /home/ivica/NORTH_TC/Build/initial.f90:229
#8 0x55f781e11ee2 in __ocean_control_mod_MOD_roms_initialize
at /home/ivica/NORTH_TC/Build/ocean_control.f90:133
#9 0x55f781e0e43d in ocean
at /home/ivica/NORTH_TC/Build/master.f90:95
#10 0x55f781e0eab2 in main
at /home/ivica/NORTH_TC/Build/master.f90:50
and so on....
Error termination. Backtrace:
and regrid.f90 155 line is MyXout(i,j)=Xout(i,j) :
DO j=Jmin,Jmax
DO i=Imin,Imax
MyXout(i,j)=Xout(i,j) ! range [-180 180]
END DO
END DO
If you want I can put the example on my server so you can grab it.
Thanks!
Ivica
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Pair and MPI version has problem with regridding
It doesn't make sense to me. MyXout is a state, tiled variable and it is allocated as the others, and the pointer is passed correctly. It is the only way that this can be done. I bet that the problem is not in regrid. It seems like a memory leakage somewhere else.
Yes, you can put the application in somewhere for me to access. I don't know what I can do but compile with the strict flags in ifort and gfortan. I cannot debug with that many processors. We need to put print statements for Imin, Imax, Jmin, Jmax, LBi, UBi, LBj, and UBj to check what got corrupted with so many processors.
Yes, you can put the application in somewhere for me to access. I don't know what I can do but compile with the strict flags in ifort and gfortan. I cannot debug with that many processors. We need to put print statements for Imin, Imax, Jmin, Jmax, LBi, UBi, LBj, and UBj to check what got corrupted with so many processors.
- jivica
- Posts: 172
- Joined: Mon May 05, 2003 2:41 pm
- Location: The University of Western Australia, Perth, Australia
- Contact:
Re: Pair and MPI version has problem with regridding
I've sent you the link privately on the email..
Ivica
Ivica
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Pair and MPI version has problem with regridding
I updated the code to correct the bug in regrid.F. Check the following trac ticket src808 for more details. The parallel bug was corrected. Good luck.