Hi,
Hope you can tell me the trick here. I'm trying to run ROMS on a 10 node(computer) Linux X86 64b cluster. (MVAPICH + Open PBS)
Each node has 2 processors. Am I right if I want to use more than 10 processes in parallel to run the code?
Fore example, the following run is sucessful
//----
mpirun -np 8 ./oceanM ocean_uwp.in
//----
But the following failed (I did set the NtileJ=16):
//-------
mpirun -np 16 ./oceanM ocean_uwp.in
//------
The error report on the screen being:
Node # 0 (pid= 22521) is active.
Node # 7 (pid= 26465) is active.
Node # 8 (pid= 11172) is active.
Node # 13 (pid= 19938) is active.
Node # 6 (pid= 17286) is active.
Node # 3 (pid= 20481) is active.
Node # 14 (pid= 17150) is active.
Node # 11 (pid= 20845) is active.
Node # 9 (pid= 22522) is active.
Node # 5 (pid= 17149) is active.
Node # 4 (pid= 19937) is active.
Node # 2 (pid= 20844) is active.
Node # 10 (pid= 22337) is active.
Node # 12 (pid= 20482) is active.
Model Input Parameters: ROMS/TOMS version 2.2
Monday - April 17, 2006 - 2:58:25 PM
-----------------------------------------------------------------------------
Node # 1 (pid= 22338) is active.
Node # 15 (pid= 17285) is active.
ROMS/TOMS 2.2 - Wind-Driven Upwelling/Downwelling over a Periodic Channel
Operating system : Linux
CPU/hardware : x86_64
Compiler system : mpif90
Compiler command : mpif90
Compiler flags : -O3
Input Script: ocean_upw.in
Resolution, Grid 01: 0041x0080x016, Parallel Nodes: 16, Tiling: 001x016
.....
.....
13 - MPI_IRECV : Invalid count argument is -138
[13] [] Aborting Program!
15 - MPI_IRECV : Invalid count argument is -198
[15] [] Aborting Program!
10 - MPI_IRECV : Invalid rank 85796672
[10] [] Aborting Program!
12 - MPI_IRECV : Invalid rank 87370496
[12] [] Aborting Program!
14 - MPI_IRECV : Invalid rank 88944320
[14] [] Aborting Program!
[9] Abort: [node2:9] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
at line 1804 in file viacheck.c
11 - MPI_IRECV : Invalid count argument is -78
-----------------
What should I do with this? Or maybe I shouldn't have tried to run more than 10 processes on a 10 node machine even though each node is a multi-processor computer?
Thanks
Wen
How to run ROMS with MPI + OpenPBS with every processors?
Hi-
I do not recognize your particular error message. However, it smells like a problem in the configuration of your mpi system. You should certainly be able to run one part of your job on each processor.
However, if you are having troubles running the code, you should try getting MPI to work on simple programs before getting deeply into ROMS, which has its own complexities. Attached is a short C program which computes Pi on a multiprocessor system (cpi.c) and a short script that I use to debug new MPI settings.
In particular, you should make sure your PATH and LD_LIBRARY_PATH are correct on all of your remote machines when you run your program through MPI. The shell script will help you to learn this by printing these variables from a shell script it submits with mpirun, or its local equivalent. The shell script then tries to compile and run cpi.c . If your MPI setup is working as you think, cpi should run and print out pi, and a message from each parallel part of the code.
Put the attached C and shell scripts into their own directory, edit the shell script to meet your particular configuration, and run it to see what happens. Note that the "vstat" command checks the status of mpi on my machine, but you may want something else on your machine.
Do not run the shell script before reading and understanding what it is doing.
Cheers,
Jamie
test_mpi.com, the test code:
And the cpi.c code is :
I do not recognize your particular error message. However, it smells like a problem in the configuration of your mpi system. You should certainly be able to run one part of your job on each processor.
However, if you are having troubles running the code, you should try getting MPI to work on simple programs before getting deeply into ROMS, which has its own complexities. Attached is a short C program which computes Pi on a multiprocessor system (cpi.c) and a short script that I use to debug new MPI settings.
In particular, you should make sure your PATH and LD_LIBRARY_PATH are correct on all of your remote machines when you run your program through MPI. The shell script will help you to learn this by printing these variables from a shell script it submits with mpirun, or its local equivalent. The shell script then tries to compile and run cpi.c . If your MPI setup is working as you think, cpi should run and print out pi, and a message from each parallel part of the code.
Put the attached C and shell scripts into their own directory, edit the shell script to meet your particular configuration, and run it to see what happens. Note that the "vstat" command checks the status of mpi on my machine, but you may want something else on your machine.
Do not run the shell script before reading and understanding what it is doing.
Cheers,
Jamie
test_mpi.com, the test code:
Code: Select all
#!/bin/csh
# this code assumes that the file "cpi.c" exists and is in the
# directory it is being run in. cpi.c just computes an approximation
# of pi on multiple processors, and takes no time to run.
# this code is designed to debug an MPI problem. Do not just read it,
# but look at it and see what it should do, and adjust to your
# circumstances. You must define the following variables and files
#
# 1) in the variable MPIRUN, the commands to run an MPI job
# 2) in the variable MPICOMP, the command to compile and link a C program
# 3) in the file "nodefile", the names of the nodes to run on. Depending
# on your MPI implementation, you may have to list nodes twice if they
# have two processors.
# 4) in the variable NTEST, the number of processors to use
#
# It will run two pieces of codes useing MPI. The first is a simple
# shell script which prints out the path, the environment, and any
# other command you want on each node. This is useful for seeing if
# things are as they should be.
#
# Good luck, J. Pringle, University of New Hampshire
# FIRST, DEFINE VARIABLES AND FILES NEEDED FOR A PARALLEL JOB.
setenv NTEST 4
setenv MPICOMP mpicc
setenv MPIRUN "mpirun_rsh -rsh -np $NTEST -hostfile nodefile "
# DEFINE NODEFILE FOR RUN
cat > nodefile <<EOF> test_script.csh << 'EOF'
#!/bin/csh
foreach line (`echo $PATH` `echo $LD_LIBRARY_PATH` `vstat`)
echo `hostname` ' : ' $line
end
'EOF'
echo " "
echo " "
chmod +x test_script.csh
$MPIRUN ./test_script.csh
/bin/rm -rf test_script.csh
echo " "
echo " "
#NOW COMPILE AND RUN TEST C CODE IN A PARALLEL MOOD
$MPICOMP -o cpi cpi.c
$MPIRUN ./cpi
And the cpi.c code is :
#include "mpi.h"
#include <stdio>
#include <math>
double f(double);
double f(double a)
{
return (4.0 / (1.0 + a*a));
}
int main(int argc,char *argv[])
{
int done = 0, n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
fprintf(stdout,"Process %d of %d on %s\n",
myid, numprocs, processor_name);
n = 0;
while (!done)
{
if (myid == 0)
{
/*
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
*/
if (n==0) n=10000; else n=0;
startwtime = MPI_Wtime();
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
done = 1;
else
{
h = 1.0 / (double) n;
sum = 0.0;
/* A slightly better approach starts from large i and works back */
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
{
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
endwtime = MPI_Wtime();
printf("wall clock time = %f\n", endwtime-startwtime);
fflush( stdout );
}
}
}
MPI_Finalize();
return 0;
}