Distributed Memory
Distributed Memory
Hi,
I'm trying to run ROMS/Ecosim using 8 processors.
1) Executable was compiled with MPI on
2) Tiling in input set to:
NtileI == 2
NtileJ == 4
3) command line:
mpirun -np 8 oceanMlatteecosim External/ocean_latte_2005_Apr_bio.in > & loglatteecosim2
It gets as far as the first line in the time stepping and then gives the errors below. Anyone know what this means?
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
p2_1008: p4_error: interrupt SIGSEGV: 11
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated
p4_1012: p4_error: interrupt SIGSEGV: 11
p1_1006: p4_error: interrupt SIGSEGV: 11
p5_1014: p4_error: interrupt SIGSEGV: 11
rm_l_6_1018: (155.343750) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_6_1018: p4_error: net_send write: -1
rm_l_7_1020: (155.300781) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_7_1020: p4_error: net_send write: -1
I'm trying to run ROMS/Ecosim using 8 processors.
1) Executable was compiled with MPI on
2) Tiling in input set to:
NtileI == 2
NtileJ == 4
3) command line:
mpirun -np 8 oceanMlatteecosim External/ocean_latte_2005_Apr_bio.in > & loglatteecosim2
It gets as far as the first line in the time stepping and then gives the errors below. Anyone know what this means?
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
p2_1008: p4_error: interrupt SIGSEGV: 11
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated
p4_1012: p4_error: interrupt SIGSEGV: 11
p1_1006: p4_error: interrupt SIGSEGV: 11
p5_1014: p4_error: interrupt SIGSEGV: 11
rm_l_6_1018: (155.343750) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_6_1018: p4_error: net_send write: -1
rm_l_7_1020: (155.300781) net_send: could not write to fd=6, errno = 9
p4_error: latest msg from perror: Bad file descriptor
rm_l_7_1020: p4_error: net_send write: -1
ok. good so far.
Next step would be to try 2 tiles. Let's go with
NtileI == 1
NtileJ == 2
as a hunch, it may be an issue in the "J" direction because you had 4
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated
errors, and J was set at 4.
Just a hunch, lets see if I am close.
Next step would be to try 2 tiles. Let's go with
NtileI == 1
NtileJ == 2
as a hunch, it may be an issue in the "J" direction because you had 4
0: DEALLOCATE: memory at 0xef5720 not allocated
0: DEALLOCATE: memory at 0xf1b940 not allocated
0: DEALLOCATE: memory at 0xf11f90 not allocated
0: DEALLOCATE: memory at 0xf9d1e0 not allocated
errors, and J was set at 4.
Just a hunch, lets see if I am close.
OK, this is what happens with 2 tiles, error message after first step
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
p1_16392: p4_error: interrupt SIGSEGV: 11
DEF_HIS - creating history file: latte_out/ecosim/his_latte_003_2005_0108.nc
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
p1_16392: p4_error: interrupt SIGSEGV: 11
DEF_HIS - creating history file: latte_out/ecosim/his_latte_003_2005_0108.nc
Not sure what Totalview is, is it for debugging?
And, Yes, the last run was
NtileI == 1
NtileJ == 2
Just tried it the other way around
NtileI == 2
NtileJ == 1
and this happens
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
0: DEALLOCATE: memory at 0xadab30 not allocated
p1_20887: p4_error: interrupt SIGSEGV: 11
And, Yes, the last run was
NtileI == 1
NtileJ == 2
Just tried it the other way around
NtileI == 2
NtileJ == 1
and this happens
STEP time[DAYS] KINETIC_ENRG POTEN_ENRG TOTAL_ENRG NET_VOLUME trd
103200 107.500000 2.914093E+00 2.047989E+02 2.077130E+02 1.364563E+12 0
0: DEALLOCATE: memory at 0xadab30 not allocated
p1_20887: p4_error: interrupt SIGSEGV: 11
without a debugger you will have to try things the 'old fashioned way' - tried and proven to work.
First compile the program with -g (or what ever the debug flag is for your system) and with the flag for check bounds. Run the model again. If the errors do not shed more light on the situation then I recommend that you put write statements in, such as:
write(*,*) 'line xx of ecosim'
recompile and run that.
It may recompile as oceanG (not oceanM).
this will help determine where in the code the error occurs.
You can usually locate the error in a few tries.
First compile the program with -g (or what ever the debug flag is for your system) and with the flag for check bounds. Run the model again. If the errors do not shed more light on the situation then I recommend that you put write statements in, such as:
write(*,*) 'line xx of ecosim'
recompile and run that.
It may recompile as oceanG (not oceanM).
this will help determine where in the code the error occurs.
You can usually locate the error in a few tries.