I tried to run ROMS with mvapich on 64 cores for 100 iterations but it is failing with SIGSEGV 11 fault
Please can you guide what should I do?
ROMS not running with mvapich
Re: ROMS not running with mvapich
This is when I get out a debugger or some print statements to find out where in the code you are getting into trouble. You don't provide enough information (nor is it always easy to get).
Re: ROMS not running with mvapich
The following is the error that I am getting when running ROMS on 64 cores for 100 iterations using mvapich
Resource usage summary:
CPU time : 1.44 sec.
Max Memory : 5 MB
Max Swap : 36 MB
The output (if any) follows:
Process Information:
Node # 19 (pid= 8769) is active.
Node # 27 (pid= 8819) is active.
Node # 51 (pid= 9920) is active.
Node # 59 (pid= 12682) is active.
Node # 62 (pid= 12919) is active.
Node # 30 (pid= 9038) is active.
Node # 22 (pid= 8988) is active.
Node # 54 (pid= 10139) is active.
Node # 26 (pid= 8746) is active.
Node # 58 (pid= 12609) is active.
Node # 23 (pid= 9061) is active.
Node # 7 (pid= 11308) is active.
Node # 3 (pid= 10536) is active.
Node # 55 (pid= 10212) is active.
Node # 31 (pid= 9111) is active.
Node # 63 (pid= 12994) is active.
Node # 35 (pid= 9915) is active.
Node # 39 (pid= 10207) is active.
Node # 47 (pid= 8892) is active.
Node # 43 (pid= 8600) is active.
Node # 15 (pid= 10214) is active.
Node # 11 (pid= 9908) is active.
Node # 2 (pid= 10338) is active.
Node # 18 (pid= 8696) is active.
Node # 50 (pid= 9847) is active.
Node # 6 (pid= 11113) is active.
Node # 34 (pid= 9842) is active.
Node # 38 (pid= 10134) is active.
Node # 10 (pid= 9832) is active.
Node # 42 (pid= 8527) is active.
Node # 0 (pid= 9848) is active.
Node # 40 (pid= 8381) is active.
Node # 8 (pid= 9570) is active.
Node # 32 (pid= 9696) is active.
Node # 24 (pid= 8600) is active.
Node # 56 (pid= 12170) is active.
Node # 12 (pid= 9987) is active.
Node # 28 (pid= 8892) is active.
Node # 44 (pid= 8673) is active.
Node # 60 (pid= 12761) is active.
Node # 14 (pid= 10141) is active.
Node # 46 (pid= 8819) is active.
Node # 20 (pid= 8842) is active.
Node # 52 (pid= 9993) is active.
Node # 4 (pid= 10732) is active.
Node # 36 (pid= 9988) is active.
Node # 5 (pid= 10924) is active.
Node # 13 (pid= 10068) is active.
Node # 45 (pid= 8746) is active.
Node # 21 (pid= 8915) is active.
Node # 53 (pid= 10066) is active.
Node # 37 (pid= 10061) is active.
Node # 29 (pid= 8965) is active.
Node # 61 (pid= 12840) is active.
Model Input Parameters: ROMS/TOMS version 3.2
Monday - November 15, 2010 - 5:17:58 PM
-----------------------------------------------------------------------------
INP_PAR - Unable to open ROMS/TOMS input script file.
In distributed-memory applications, the input
script file is processed in parallel. The Unix
routine GETARG is used to get script file name.
For example, in MPI applications make sure that
command line is something like:
mpirun -np 4 ocean ocean.in
and not
mpirun -np 4 ocean < ocean.in
Elapsed CPU time (seconds):
Node # 48 (pid= 9701) is active.
Node # 16 (pid= 8550) is active.
Node # 1 (pid= 10142) is active.
Node # 33 (pid= 9769) is active.
Node # 17 (pid= 8623) is active.
Node # 49 (pid= 9774) is active.
Node # 9 (pid= 9756) is active.
Node # 41 (pid= 8454) is active.
Node # 57 (pid= 12530) is active.
Node # 25 (pid= 8673) is active.
Node # 2 CPU: 0.001
Node # 5 CPU: 0.006
Node # 0 CPU: 0.003
Node # 4 CPU: 0.000
Node # 18 CPU: 0.087
Node # 28 CPU: 0.194
Node # 27 CPU: 0.027
Node # 59 CPU: 0.018
Node # 21 CPU: 0.016
Node # 56 CPU: 0.015
Node # 26 CPU: 0.089
Node # 58 CPU: 0.002
Node # 29 CPU: 0.223
Node # 50 CPU: 0.201
Node # 19 CPU: 0.002
Node # 31 CPU: 0.003
Node # 53 CPU: 0.207
Node # 20 CPU: 0.103
Node # 7 CPU: 0.006
Node # 60 CPU: 0.106
Node # 6 CPU: 0.006
Node # 54 CPU: 0.102
Node # 63 CPU: 0.019
Node # 22 CPU: 0.007
Node # 55 CPU: 0.010
Node # 30 CPU: 0.005
Node # 24 CPU: 0.106
Node # 62 CPU: 0.006
Node # 51 CPU: 0.015
Node # 52 CPU: 0.105
Node # 61 CPU: 0.225
Node # 23 CPU: 0.111
Node # 3 CPU: 0.028
p61_12840: p4_error: interrupt SIGSEGV: 11
p59_12682: p4_error: interrupt SIGSEGV: 11
p56_12170: p4_error: interrupt SIGSEGV: 11
p58_12609: p4_error: interrupt SIGSEGV: 11
p60_12761: p4_error: interrupt SIGSEGV: 11
p28_8892: p4_error: interrupt SIGSEGV: 11
p26_8746: p4_error: interrupt SIGSEGV: 11
p29_8965: p4_error: interrupt SIGSEGV: 11
p31_9111: p4_error: interrupt SIGSEGV: 11
p63_12994: p4_error: interrupt SIGSEGV: 11
p24_8600: p4_error: interrupt SIGSEGV: 11
p23_9061: p4_error: interrupt SIGSEGV: 11
p62_12919: p4_error: interrupt SIGSEGV: 11
p22_8988: p4_error: interrupt SIGSEGV: 11
p20_8842: p4_error: interrupt SIGSEGV: 11
p4_10732: p4_error: net_recv read: probable EOF on socket: 1
p55_10212: p4_error: interrupt SIGSEGV: 11
p52_9993: p4_error: interrupt SIGSEGV: 11
p51_9920: p4_error: interrupt SIGSEGV: 11
p54_10139: p4_error: interrupt SIGSEGV: 11
p27_8819: p4_error: interrupt SIGSEGV: 11
p18_8696: p4_error: interrupt SIGSEGV: 11
p19_8769: p4_error: interrupt SIGSEGV: 11
p30_9038: p4_error: interrupt SIGSEGV: 11
p5_10924: p4_error: net_recv read: probable EOF on socket: 1
p21_8915: p4_error: interrupt SIGSEGV: 11
p50_9847: p4_error: interrupt SIGSEGV: 11
p53_10066: p4_error: interrupt SIGSEGV: 11
Node # 40 CPU: 0.203
Node # 47 CPU: 0.110
Node # 43 CPU: 0.114
Node # 45 CPU: 0.021
Node # 44 CPU: 0.203
Node # 42 CPU: 0.200
Node # 46 CPU: 0.207
Node # 35 CPU: 0.108
Node # 37 CPU: 0.012
Node # 34 CPU: 0.104
Node # 36 CPU: 0.104
Node # 39 CPU: 0.104
Node # 32 CPU: 0.001
Node # 38 CPU: 0.098
p40_8381: p4_error: interrupt SIGSEGV: 11
p35_9915: p4_error: interrupt SIGSEGV: 11
p34_9842: p4_error: interrupt SIGSEGV: 11
p36_9988: p4_error: interrupt SIGSEGV: 11
p37_10061: p4_error: interrupt SIGSEGV: 11
p39_10207: p4_error: interrupt SIGSEGV: 11
p43_8600: p4_error: interrupt SIGSEGV: 11
p45_8746: p4_error: interrupt SIGSEGV: 11
p44_8673: p4_error: interrupt SIGSEGV: 11
p42_8527: p4_error: interrupt SIGSEGV: 11
p38_10134: p4_error: interrupt SIGSEGV: 11
p32_9696: p4_error: interrupt SIGSEGV: 11
p46_8819: p4_error: interrupt SIGSEGV: 11
p47_8892: p4_error: interrupt SIGSEGV: 11
Node # 16 CPU: 0.004
p16_8550: p4_error: interrupt SIGSEGV: 11
Node # 48 CPU: 0.200
p48_9701: p4_error: interrupt SIGSEGV: 11
Node # 17 CPU: 0.006
p17_8623: p4_error: interrupt SIGSEGV: 11
Node # 49 CPU: 0.211
p49_9774: p4_error: interrupt SIGSEGV: 11
Node # 33 CPU: 0.110
p33_9769: p4_error: interrupt SIGSEGV: 11
Node # 41 CPU: 0.310
p41_8454: p4_error: interrupt SIGSEGV: 11
Node # 1 CPU: 0.000
Node # 57 CPU: 0.319
p57_12530: p4_error: interrupt SIGSEGV: 11
Node # 25 CPU: 0.109
p25_8673: p4_error: interrupt SIGSEGV: 11
PS:
Read file <benchmark4_err_file> for stderr output of this job.
Resource usage summary:
CPU time : 1.44 sec.
Max Memory : 5 MB
Max Swap : 36 MB
The output (if any) follows:
Process Information:
Node # 19 (pid= 8769) is active.
Node # 27 (pid= 8819) is active.
Node # 51 (pid= 9920) is active.
Node # 59 (pid= 12682) is active.
Node # 62 (pid= 12919) is active.
Node # 30 (pid= 9038) is active.
Node # 22 (pid= 8988) is active.
Node # 54 (pid= 10139) is active.
Node # 26 (pid= 8746) is active.
Node # 58 (pid= 12609) is active.
Node # 23 (pid= 9061) is active.
Node # 7 (pid= 11308) is active.
Node # 3 (pid= 10536) is active.
Node # 55 (pid= 10212) is active.
Node # 31 (pid= 9111) is active.
Node # 63 (pid= 12994) is active.
Node # 35 (pid= 9915) is active.
Node # 39 (pid= 10207) is active.
Node # 47 (pid= 8892) is active.
Node # 43 (pid= 8600) is active.
Node # 15 (pid= 10214) is active.
Node # 11 (pid= 9908) is active.
Node # 2 (pid= 10338) is active.
Node # 18 (pid= 8696) is active.
Node # 50 (pid= 9847) is active.
Node # 6 (pid= 11113) is active.
Node # 34 (pid= 9842) is active.
Node # 38 (pid= 10134) is active.
Node # 10 (pid= 9832) is active.
Node # 42 (pid= 8527) is active.
Node # 0 (pid= 9848) is active.
Node # 40 (pid= 8381) is active.
Node # 8 (pid= 9570) is active.
Node # 32 (pid= 9696) is active.
Node # 24 (pid= 8600) is active.
Node # 56 (pid= 12170) is active.
Node # 12 (pid= 9987) is active.
Node # 28 (pid= 8892) is active.
Node # 44 (pid= 8673) is active.
Node # 60 (pid= 12761) is active.
Node # 14 (pid= 10141) is active.
Node # 46 (pid= 8819) is active.
Node # 20 (pid= 8842) is active.
Node # 52 (pid= 9993) is active.
Node # 4 (pid= 10732) is active.
Node # 36 (pid= 9988) is active.
Node # 5 (pid= 10924) is active.
Node # 13 (pid= 10068) is active.
Node # 45 (pid= 8746) is active.
Node # 21 (pid= 8915) is active.
Node # 53 (pid= 10066) is active.
Node # 37 (pid= 10061) is active.
Node # 29 (pid= 8965) is active.
Node # 61 (pid= 12840) is active.
Model Input Parameters: ROMS/TOMS version 3.2
Monday - November 15, 2010 - 5:17:58 PM
-----------------------------------------------------------------------------
INP_PAR - Unable to open ROMS/TOMS input script file.
In distributed-memory applications, the input
script file is processed in parallel. The Unix
routine GETARG is used to get script file name.
For example, in MPI applications make sure that
command line is something like:
mpirun -np 4 ocean ocean.in
and not
mpirun -np 4 ocean < ocean.in
Elapsed CPU time (seconds):
Node # 48 (pid= 9701) is active.
Node # 16 (pid= 8550) is active.
Node # 1 (pid= 10142) is active.
Node # 33 (pid= 9769) is active.
Node # 17 (pid= 8623) is active.
Node # 49 (pid= 9774) is active.
Node # 9 (pid= 9756) is active.
Node # 41 (pid= 8454) is active.
Node # 57 (pid= 12530) is active.
Node # 25 (pid= 8673) is active.
Node # 2 CPU: 0.001
Node # 5 CPU: 0.006
Node # 0 CPU: 0.003
Node # 4 CPU: 0.000
Node # 18 CPU: 0.087
Node # 28 CPU: 0.194
Node # 27 CPU: 0.027
Node # 59 CPU: 0.018
Node # 21 CPU: 0.016
Node # 56 CPU: 0.015
Node # 26 CPU: 0.089
Node # 58 CPU: 0.002
Node # 29 CPU: 0.223
Node # 50 CPU: 0.201
Node # 19 CPU: 0.002
Node # 31 CPU: 0.003
Node # 53 CPU: 0.207
Node # 20 CPU: 0.103
Node # 7 CPU: 0.006
Node # 60 CPU: 0.106
Node # 6 CPU: 0.006
Node # 54 CPU: 0.102
Node # 63 CPU: 0.019
Node # 22 CPU: 0.007
Node # 55 CPU: 0.010
Node # 30 CPU: 0.005
Node # 24 CPU: 0.106
Node # 62 CPU: 0.006
Node # 51 CPU: 0.015
Node # 52 CPU: 0.105
Node # 61 CPU: 0.225
Node # 23 CPU: 0.111
Node # 3 CPU: 0.028
p61_12840: p4_error: interrupt SIGSEGV: 11
p59_12682: p4_error: interrupt SIGSEGV: 11
p56_12170: p4_error: interrupt SIGSEGV: 11
p58_12609: p4_error: interrupt SIGSEGV: 11
p60_12761: p4_error: interrupt SIGSEGV: 11
p28_8892: p4_error: interrupt SIGSEGV: 11
p26_8746: p4_error: interrupt SIGSEGV: 11
p29_8965: p4_error: interrupt SIGSEGV: 11
p31_9111: p4_error: interrupt SIGSEGV: 11
p63_12994: p4_error: interrupt SIGSEGV: 11
p24_8600: p4_error: interrupt SIGSEGV: 11
p23_9061: p4_error: interrupt SIGSEGV: 11
p62_12919: p4_error: interrupt SIGSEGV: 11
p22_8988: p4_error: interrupt SIGSEGV: 11
p20_8842: p4_error: interrupt SIGSEGV: 11
p4_10732: p4_error: net_recv read: probable EOF on socket: 1
p55_10212: p4_error: interrupt SIGSEGV: 11
p52_9993: p4_error: interrupt SIGSEGV: 11
p51_9920: p4_error: interrupt SIGSEGV: 11
p54_10139: p4_error: interrupt SIGSEGV: 11
p27_8819: p4_error: interrupt SIGSEGV: 11
p18_8696: p4_error: interrupt SIGSEGV: 11
p19_8769: p4_error: interrupt SIGSEGV: 11
p30_9038: p4_error: interrupt SIGSEGV: 11
p5_10924: p4_error: net_recv read: probable EOF on socket: 1
p21_8915: p4_error: interrupt SIGSEGV: 11
p50_9847: p4_error: interrupt SIGSEGV: 11
p53_10066: p4_error: interrupt SIGSEGV: 11
Node # 40 CPU: 0.203
Node # 47 CPU: 0.110
Node # 43 CPU: 0.114
Node # 45 CPU: 0.021
Node # 44 CPU: 0.203
Node # 42 CPU: 0.200
Node # 46 CPU: 0.207
Node # 35 CPU: 0.108
Node # 37 CPU: 0.012
Node # 34 CPU: 0.104
Node # 36 CPU: 0.104
Node # 39 CPU: 0.104
Node # 32 CPU: 0.001
Node # 38 CPU: 0.098
p40_8381: p4_error: interrupt SIGSEGV: 11
p35_9915: p4_error: interrupt SIGSEGV: 11
p34_9842: p4_error: interrupt SIGSEGV: 11
p36_9988: p4_error: interrupt SIGSEGV: 11
p37_10061: p4_error: interrupt SIGSEGV: 11
p39_10207: p4_error: interrupt SIGSEGV: 11
p43_8600: p4_error: interrupt SIGSEGV: 11
p45_8746: p4_error: interrupt SIGSEGV: 11
p44_8673: p4_error: interrupt SIGSEGV: 11
p42_8527: p4_error: interrupt SIGSEGV: 11
p38_10134: p4_error: interrupt SIGSEGV: 11
p32_9696: p4_error: interrupt SIGSEGV: 11
p46_8819: p4_error: interrupt SIGSEGV: 11
p47_8892: p4_error: interrupt SIGSEGV: 11
Node # 16 CPU: 0.004
p16_8550: p4_error: interrupt SIGSEGV: 11
Node # 48 CPU: 0.200
p48_9701: p4_error: interrupt SIGSEGV: 11
Node # 17 CPU: 0.006
p17_8623: p4_error: interrupt SIGSEGV: 11
Node # 49 CPU: 0.211
p49_9774: p4_error: interrupt SIGSEGV: 11
Node # 33 CPU: 0.110
p33_9769: p4_error: interrupt SIGSEGV: 11
Node # 41 CPU: 0.310
p41_8454: p4_error: interrupt SIGSEGV: 11
Node # 1 CPU: 0.000
Node # 57 CPU: 0.319
p57_12530: p4_error: interrupt SIGSEGV: 11
Node # 25 CPU: 0.109
p25_8673: p4_error: interrupt SIGSEGV: 11
PS:
Read file <benchmark4_err_file> for stderr output of this job.
Re: ROMS not running with mvapich
How did you invoke ROMS in your script? It should have the ocean.in filename as the first argument. This is described here.
Re: ROMS not running with mvapich
I did mention ocean_benchmark4.in quite correctly int the script and its running for all sizes except for 1024