How to run ROMS with MPI + OpenPBS with every processors?

Message

longmtm · #1 Unread post by **longmtm** » Mon Apr 17, 2006 6:55 pm

Hi,

Hope you can tell me the trick here. I'm trying to run ROMS on a 10 node(computer) Linux X86 64b cluster. (MVAPICH + Open PBS)

Each node has 2 processors. Am I right if I want to use more than 10 processes in parallel to run the code?

Fore example, the following run is sucessful
//----
mpirun -np 8 ./oceanM ocean_uwp.in
//----

But the following failed (I did set the NtileJ=16):
//-------
mpirun -np 16 ./oceanM ocean_uwp.in
//------
The error report on the screen being:

Node # 0 (pid= 22521) is active.
Node # 7 (pid= 26465) is active.
Node # 8 (pid= 11172) is active.
Node # 13 (pid= 19938) is active.
Node # 6 (pid= 17286) is active.
Node # 3 (pid= 20481) is active.
Node # 14 (pid= 17150) is active.
Node # 11 (pid= 20845) is active.
Node # 9 (pid= 22522) is active.
Node # 5 (pid= 17149) is active.
Node # 4 (pid= 19937) is active.
Node # 2 (pid= 20844) is active.
Node # 10 (pid= 22337) is active.
Node # 12 (pid= 20482) is active.

Model Input Parameters: ROMS/TOMS version 2.2
Monday - April 17, 2006 - 2:58:25 PM
-----------------------------------------------------------------------------
Node # 1 (pid= 22338) is active.
Node # 15 (pid= 17285) is active.

ROMS/TOMS 2.2 - Wind-Driven Upwelling/Downwelling over a Periodic Channel

Operating system : Linux
CPU/hardware : x86_64
Compiler system : mpif90
Compiler command : mpif90
Compiler flags : -O3

Input Script: ocean_upw.in

Resolution, Grid 01: 0041x0080x016, Parallel Nodes: 16, Tiling: 001x016

.....
.....

13 - MPI_IRECV : Invalid count argument is -138
[13] [] Aborting Program!
15 - MPI_IRECV : Invalid count argument is -198
[15] [] Aborting Program!
10 - MPI_IRECV : Invalid rank 85796672
[10] [] Aborting Program!
12 - MPI_IRECV : Invalid rank 87370496
[12] [] Aborting Program!
14 - MPI_IRECV : Invalid rank 88944320
[14] [] Aborting Program!
[9] Abort: [node2:9] Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81
at line 1804 in file viacheck.c
11 - MPI_IRECV : Invalid count argument is -78

-----------------

What should I do with this? Or maybe I shouldn't have tried to run more than 10 processes on a 10 node machine even though each node is a multi-processor computer?

Thanks

Wen

jpringle · #2 Unread post by **jpringle** » Mon Apr 24, 2006 6:12 pm

Hi-

I do not recognize your particular error message. However, it smells like a problem in the configuration of your mpi system. You should certainly be able to run one part of your job on each processor.

However, if you are having troubles running the code, you should try getting MPI to work on simple programs before getting deeply into ROMS, which has its own complexities. Attached is a short C program which computes Pi on a multiprocessor system (cpi.c) and a short script that I use to debug new MPI settings.

In particular, you should make sure your PATH and LD_LIBRARY_PATH are correct on all of your remote machines when you run your program through MPI. The shell script will help you to learn this by printing these variables from a shell script it submits with mpirun, or its local equivalent. The shell script then tries to compile and run cpi.c . If your MPI setup is working as you think, cpi should run and print out pi, and a message from each parallel part of the code.

Put the attached C and shell scripts into their own directory, edit the shell script to meet your particular configuration, and run it to see what happens. Note that the "vstat" command checks the status of mpi on my machine, but you may want something else on your machine.

Do not run the shell script before reading and understanding what it is doing.

Cheers,
Jamie

test_mpi.com, the test code:

Code: Select all

#!/bin/csh 

# this code assumes that the file "cpi.c" exists and is in the
# directory it is being run in.  cpi.c just computes an approximation
# of pi on multiple processors, and takes no time to run.

# this code is designed to debug an MPI problem.  Do not just read it,
# but look at it and see what it should do, and adjust to your
# circumstances. You must define the following variables and files
#
#    1) in the variable MPIRUN, the commands to run an MPI job
#    2) in the variable MPICOMP, the command to compile and link a C program
#    3) in the file "nodefile", the names of the nodes to run on.  Depending
#       on your MPI implementation, you may have to list nodes twice if they
#       have two processors. 
#    4) in the variable NTEST, the number of processors to use
#

# It will run two pieces of codes useing MPI.  The first is a simple
# shell script which prints out the path, the environment, and any
# other command you want on each node.  This is useful for seeing if
# things are as they should be.
#
# Good luck, J. Pringle, University of New Hampshire

# FIRST, DEFINE VARIABLES AND FILES NEEDED FOR A PARALLEL JOB. 
setenv NTEST 4
setenv MPICOMP mpicc
setenv MPIRUN "mpirun_rsh -rsh -np $NTEST -hostfile nodefile "

# DEFINE NODEFILE FOR RUN
cat > nodefile <<EOF> test_script.csh << 'EOF'
#!/bin/csh 
foreach line (`echo $PATH` `echo $LD_LIBRARY_PATH` `vstat`)
        echo `hostname` ' : ' $line
end
'EOF'

echo " "
echo " "
chmod +x test_script.csh
$MPIRUN ./test_script.csh
/bin/rm -rf test_script.csh
echo " "
echo " "

#NOW COMPILE AND RUN TEST C CODE IN A PARALLEL MOOD
$MPICOMP -o cpi cpi.c
$MPIRUN ./cpi

And the cpi.c code is :

#include "mpi.h"
#include <stdio>
#include <math>

double f(double);

double f(double a)
{
return (4.0 / (1.0 + a*a));
}

int main(int argc,char *argv[])
{
int done = 0, n, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);

fprintf(stdout,"Process %d of %d on %s\n",
myid, numprocs, processor_name);

n = 0;
while (!done)
{
if (myid == 0)
{
/*
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
*/
if (n==0) n=10000; else n=0;

startwtime = MPI_Wtime();
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0)
done = 1;
else
{
h = 1.0 / (double) n;
sum = 0.0;
/* A slightly better approach starts from large i and works back */
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

if (myid == 0)
{
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
endwtime = MPI_Wtime();
printf("wall clock time = %f\n", endwtime-startwtime);
fflush( stdout );
}
}
}
MPI_Finalize();
return 0;
}

longmtm · #3 Unread post by **longmtm** » Mon Apr 24, 2006 6:28 pm

Jpringle:

Thank you very much for the suggestions.

I did make the MPI work for cpi.c . The problem is solved after we recheck the OpenPBS settings and restart the PBS server and mom daemons.

Wen Long