sudden blowup using mpi

Report or discuss software problems and other woes

Moderators: arango, robertson

Post Reply
Message
Author
asujrpv

sudden blowup using mpi

#1 Unread post by asujrpv »

Hello everybody,

I have run successfully a roms simulation using mpi but for small amount of nodes (30 x 60). I now increased the number to 160x160 and used smaller time step to account for the Courant number, which is about 0.1. The fine grid runs fine in serial, but when I used parallel mpi, I got in the middle of the run the following error. I wonder if you have experienced something similar, your help is much appreciated.

360 0 09:00:00 1.197521E-05 2.230088E+02 2.230088E+02 2.095579E+11
(022,001,05) 3.050260E-03 1.133388E-05 4.942943E-02 4.500022E-02
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000010
WRT_AVG - wrote averaged fields into time record = 0000009
WRT_DIAGS - wrote diagnostics fields into time record = 0000009
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000001
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
oceanM 000000000079B03A Unknown Unknown Unknown
oceanM 000000000077A699 Unknown Unknown Unknown
oceanM 000000000076741C Unknown Unknown Unknown
oceanM 000000000075B7B0 Unknown Unknown Unknown
oceanM 000000000046F374 Unknown Unknown Unknown

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: sudden blowup using mpi

#2 Unread post by kate »

A seg fault can have several causes. Have you tried running with array bounds checking turned on?

asujrpv

Re: sudden blowup using mpi

#3 Unread post by asujrpv »

kate wrote:A seg fault can have several causes. Have you tried running with array bounds checking turned on?
Kate,
I need to more guidance here, would you be kind to letting me know where I need to check this?
thanks
Rafael

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: sudden blowup using mpi

#4 Unread post by kate »

There is a compile-time option, sometimes "-C", for turning on array-bounds checking. It might get turned on with the USE_DEBUG option. It causes extra overhead, so don't use it for production runs, but it's perfect for unknown problems like this.

asujrpv

Re: sudden blowup using mpi

#5 Unread post by asujrpv »

Kate, now the error I got after using the debugg option is below, what can I do?


1440 1 12:00:00 4.359632E-06 2.230086E+02 2.230087E+02 2.095579E+11
(008,017,05) 1.520756E-03 3.943592E-04 1.415241E-02 2.390050E-02
WRT_HIS - wrote history fields (Index=1,1) into time record = 0000037
WRT_AVG - wrote averaged fields into time record = 0000036
WRT_DIAGS - wrote diagnostics fields into time record = 0000036
WRT_RST - wrote re-start fields (Index=1,1) into time record = 0000002
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
oceanG 0000000000956E76 Unknown Unknown Unknown
oceanG 0000000000956F93 Unknown Unknown Unknown
oceanG 0000000000948BDD Unknown Unknown Unknown
oceanG 000000000093D21F Unknown Unknown Unknown
oceanG 0000000000496DB1 distribute_mod_mp 2821 distribute.f90
oceanG 000000000063007E obc_volcons_mod_m 229 obc_volcons.f90
oceanG 000000000056F157 step2d_mod_mp_ste 1204 step2d.f90
oceanG 0000000000538DBF step2d_mod_mp_ste 71 step2d.f90
oceanG 00000000004B7459 main3d_ 247 main3d.f90
oceanG 0000000000406639 ocean_control_mod 153 ocean_control.f90
oceanG 000000000040606C MAIN__ 108 master.f90
oceanG 0000000000405D8C Unknown Unknown Unknown
libc.so.6 000000388081D994 Unknown Unknown Unknown
oceanG 0000000000405C99 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 000000388100B725 Unknown Unknown Unknown
libmthca-rdmav2.s 00002AAAAB162B4F Unknown Unknown Unknown
oceanG 00000000009648CB Unknown Unknown Unknown
oceanG 000000000096D4AB Unknown Unknown Unknown
oceanG 000000000093EE0A Unknown Unknown Unknown
oceanG 0000000000982932 Unknown Unknown Unknown
oceanG 0000000000957451 Unknown Unknown Unknown
oceanG 0000000000956F93 Unknown Unknown Unknown
oceanG 0000000000948BDD Unknown Unknown Unknown
oceanG 000000000093D21F Unknown Unknown Unknown
oceanG 0000000000496DB1 distribute_mod_mp 2821 distribute.f90
oceanG 000000000063007E obc_volcons_mod_m 229 obc_volcons.f90
oceanG 000000000056F157 step2d_mod_mp_ste 1204 step2d.f90
oceanG 0000000000538DBF step2d_mod_mp_ste 71 step2d.f90
oceanG 00000000004B7459 main3d_ 247 main3d.f90
oceanG 0000000000406639 ocean_control_mod 153 ocean_control.f90
oceanG 000000000040606C MAIN__ 108 master.f90
oceanG 0000000000405D8C Unknown Unknown Unknown
libc.so.6 000000388081D994 Unknown Unknown Unknown
oceanG 0000000000405C99 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libpthread.so.0 000000388100B720 Unknown Unknown Unknown
libmthca-rdmav2.s 00002AAAAB162B4F Unknown Unknown Unknown
oceanG 00000000009648CB Unknown Unknown Unknown
oceanG 000000000096D4AB Unknown Unknown Unknown
oceanG 000000000093EE0A Unknown Unknown Unknown
oceanG 0000000000982932 Unknown Unknown Unknown
oceanG 0000000000957451 Unknown Unknown Unknown
oceanG 0000000000956F93 Unknown Unknown Unknown
oceanG 0000000000948BDD Unknown Unknown Unknown
oceanG 000000000093D21F Unknown Unknown Unknown
oceanG 0000000000496DB1 distribute_mod_mp 2821 distribute.f90
oceanG 000000000063007E obc_volcons_mod_m 229 obc_volcons.f90
oceanG 000000000056F157 step2d_mod_mp_ste 1204 step2d.f90
oceanG 0000000000538DBF step2d_mod_mp_ste 71 step2d.f90
oceanG 00000000004B7459 main3d_ 247 main3d.f90
oceanG 0000000000406639 ocean_control_mod 153 ocean_control.f90
oceanG 000000000040606C MAIN__ 108 master.f90
oceanG 0000000000405D8C Unknown Unknown Unknown
libc.so.6 000000388081D994 Unknown Unknown Unknown
oceanG 0000000000405C99 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
oceanG 0000000000978940 Unknown Unknown Unknown
oceanG 00000000009643E1 Unknown Unknown Unknown
oceanG 000000000096D4AB Unknown Unknown Unknown
oceanG 000000000093EE0A Unknown Unknown Unknown
oceanG 0000000000982932 Unknown Unknown Unknown
oceanG 0000000000957451 Unknown Unknown Unknown
oceanG 0000000000956F93 Unknown Unknown Unknown
oceanG 0000000000948BDD Unknown Unknown Unknown
oceanG 000000000093D21F Unknown Unknown Unknown

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: sudden blowup using mpi

#6 Unread post by kate »

Did compiling with USE_DEBUG cause it to use array bounds checking? If so, the cause of this seg fault is something else. Do you have a parallel debugger you can try? If not, then I would try print statements. What operation is it dying on? What happens right after the call to WRT_RST? Is it running out of memory somehow? Can you try a different computer and/or compiler?

Post Reply