Dear all,
Something is wrong when ROMS run. The paralel settings is NtileI==1;NtileJ==1. Here are the clues:
/** rank 0 in job 1 c0202_35389 caused collective abort of all ranks.
exit status of rank 0: killed by signal 9 **/
I don't know what that is mean. can someone help me? Any information is helpful.
Thank you
Problems on run
Re: Problems on run
Did you compile it for parallel processing? Did you submit it as a parallel job with mpirun? How many tasks? At what state of the initialization did it die?
Re: Problems on run
Dear Kate,
Thank you for your reply.
Yes, I did it for parallel processing with mpirun. At first, the settings is 2*8, but it doesn't run.Finally, the settings is 1*1, however, the same problems occured. The process of make is ok. The initialization has passed, consequently it died without any computation. The clues as followed:
rank 0 in job 1 c0202_35389 caused collective abort of all ranks.
exit status of rank 0: killed by signal 9
I don't know what's wrong and where something is wrong?
Thank you for your reply.
Yes, I did it for parallel processing with mpirun. At first, the settings is 2*8, but it doesn't run.Finally, the settings is 1*1, however, the same problems occured. The process of make is ok. The initialization has passed, consequently it died without any computation. The clues as followed:
rank 0 in job 1 c0202_35389 caused collective abort of all ranks.
exit status of rank 0: killed by signal 9
I don't know what's wrong and where something is wrong?
Re: Problems on run
What happens if you recompile in serial mode? Does that work?
In the above, does ROMS print out anything at all in the initialization?
In the above, does ROMS print out anything at all in the initialization?
Re: Problems on run
Dear kate,
Thank you.
In serial mode, Roms can run. However, why not run in MPI mode?
Thank you.
In serial mode, Roms can run. However, why not run in MPI mode?
Re: Problems on run
I had this error message
because I had computer memory issues. I increased the number of processors and it ran but I'm not sure this message is unique to this problem. But you could try... Could you also post the full log of your experiment?** rank 0 in job 1 c0202_35389 caused collective abort of all ranks