Hello,
I have always run the ROMS model in parallel using OpenMP, but recently I changed to MPI. To my surprise, depending on the number of processors and the model decomposition, I get blowups, and/or different results. With OpenMP, I had been using a 4x4 scheme. If I use the same scheme with MPI, I get a blowup. 5x4 seems to work ok, but if I use more processors, I get blowups again. The only scheme that worked with more than 5x4 processors is 1x35.
When I get blowups, it doesn't matter if I change the timestep or VISC2/VISC4/TNU2/TNU4 parameters, it keeps blowing up.
With the different configurations, normally there is no difference in the mean kynetic energy (up to the point when it blows up), as you can see in the figure attached (the blue and green lines overlap). But with some configurations using 40 processors (both 5x8 and 1x40) the mean kynetic energy increases, owing to a strong current near the southern boundary, that you can see in the plot. The configuration 1x35 worked, 1x40 didn't.
The only parameter that changed in all the configurations was NtileI and NtileJ, the rest didn't change (boundary conditions, forcings, timestep, theta_b,theta_s, grid,...). I am using ROMS version 3.3. The only thing I did to use MPI was to change the value of "USE_MPI ?= on" and unset "USE_OpenMP =?" in the makefile, and to set the compiler "FC := mpiifort" in "Compilers/Linux-ifort.mk". With this setup I got ROMS compiled and run with the 5x4 scheme, with the same results as with OpenMP.
Am I missing something? Am I doing anything wrong? Has anybody had issues like this? Any comment would be a great help.
Many thanks,
Marcos Cobas Garcia
Centro de Supercomputacion de Galicia
Blowups depending on model decomposition with MPI
Re: Blowups depending on model decomposition with MPI
Rather than running it until it blows up, I would run for say 5-10 steps, saving a history record each timestep. Do this for a good case and a bad case, then do ncdiff on the two outputs. The correct answer is that you get all zeroes, but clearly there is a bug somewhere.
What are your values of Lm and Mm? It's possible that we've only really tested the case where Lm/NtileI and Mm/NtileJ are integers.
What are your values of Lm and Mm? It's possible that we've only really tested the case where Lm/NtileI and Mm/NtileJ are integers.
Re: Blowups depending on model decomposition with MPI
Hello Kate and everybody,
Thanks for your fast reply, Kate.
So far I had compared the solutions visually, and they seemed equal to me. But today I took the two configurations with 35 processors (5x7 and 1x35) and compared the temperature at every layer. The difference is relevant for the top layers, as you may see in the figures attached, from the very beguinning. The differences in the bottom layers are much smaller, up to 2 days before it blows up (the output is saved every 12 hours, approx.)
The values of Lm and Mm are Lm = 258 and Mm = 508. I picked some bad numbers. 508 is only divisible by 2,4,127 and 254. If I had chosen 500, or even 507, I would have more options. 258 is only divisible by 2,3,6, and much bigger numbers. I will try a decomposition in 6x4, which gives integer values of Lm/NtileI and Mm/NtileJ, as you suggest.
Thanks a lot,
Marcos Cobas Garcia
Centro de Supercomputacion de Galicia
Thanks for your fast reply, Kate.
So far I had compared the solutions visually, and they seemed equal to me. But today I took the two configurations with 35 processors (5x7 and 1x35) and compared the temperature at every layer. The difference is relevant for the top layers, as you may see in the figures attached, from the very beguinning. The differences in the bottom layers are much smaller, up to 2 days before it blows up (the output is saved every 12 hours, approx.)
The values of Lm and Mm are Lm = 258 and Mm = 508. I picked some bad numbers. 508 is only divisible by 2,4,127 and 254. If I had chosen 500, or even 507, I would have more options. 258 is only divisible by 2,3,6, and much bigger numbers. I will try a decomposition in 6x4, which gives integer values of Lm/NtileI and Mm/NtileJ, as you suggest.
Thanks a lot,
Marcos Cobas Garcia
Centro de Supercomputacion de Galicia
Re: Blowups depending on model decomposition with MPI
one approach is to #undef all the extra stuff like uv_visc2 and ts_mixing and tides etc etc etc. Run the model for just a few time steps and save every time step. Then slowly turn these things back on. compare solutions. One note of caution- running the model w/ the coeefs for mix to =0 may not be the same as #undef that mixing and rebuilding. One would hope so, but i suggest that you undef these things and slowly add them back in.
Re: Blowups depending on model decomposition with MPI
Hello,
Thanks for the replies.
The run for the 6x4 domain decomposition didn't work either. Besides, I noticed there are differences between the OpenMP 4x4 and MPI 1x35 runs. In temperature the maximum difference is over 2.5 degrees. The differences in temperature, salinity, zeta, u and v concentrate near the shelf break, mainly, and have a similar pattern.
I will try as jcwarner suggests, and keep posting.
Thanks a lot,
Marcos Cobas
Centro de Supercomputacion de Galicia
Thanks for the replies.
The run for the 6x4 domain decomposition didn't work either. Besides, I noticed there are differences between the OpenMP 4x4 and MPI 1x35 runs. In temperature the maximum difference is over 2.5 degrees. The differences in temperature, salinity, zeta, u and v concentrate near the shelf break, mainly, and have a similar pattern.
I will try as jcwarner suggests, and keep posting.
Thanks a lot,
Marcos Cobas
Centro de Supercomputacion de Galicia
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Re: Blowups depending on model decomposition with MPI
I don't know what the problem is here. It seems a problem with partition and a parallel bug. However, your choice of partitions are not optimal. I also think that this application is not that large to use that high number of processors. You will be penalized by excessive MPI communications. There is always an optimal number of partitions per application. I had mentioned this so many times in this forum. The fact that you have so many processor available does not means that you need to use all of them to run your application. I bet that your application will be more efficient with less number of processors.
My experience with distributed-memory is that I get the same behavior with MPICH1, MPICH2, and OpenMPI. In some cases you can get identical solutions which depends on compiler, compiler flags, and MPI libraries. We always use the MPI libraries compiled with the same version of the compiler as ROMS. Notice that in the distributed-memory exchanges, ROMS uses the lower level MPI communication routines. Nothing fancy.
As far as I know the ROMS code, as distributed, is free of distributed-memory (MPI) parallel bugs. If a parallel bug is found in application it is usually associated with the user customization of the code. The MPI paradigm is the easier one. However, shared-memory and serial with partitions is more difficult in ROMS. I sometimes forget this when coding. I am so used to MPI nowadays. All the adjoint-based algorithms only work in MPI.
Now, OpenMP is a protocol for shared-memory. I just fixed today a parallel bug for the biharmonic stress tensor in shared-memory and serial with partitions. See ticket. You need to update. This will be a problem for you if you are using shared-memory, OpenMP.
I always recommend users to use the build.sh or build.bach script instead of modifying the makefile.
When configuring a large application, it is always a good idea to set your grid as a multiple of 2 or 3. I actually use powers of 2, the best... This allows a lot of choices for tile partition and tile balancing for efficiency. You cannot select the number of grid points capriciously in the parallel computing world.
My experience with distributed-memory is that I get the same behavior with MPICH1, MPICH2, and OpenMPI. In some cases you can get identical solutions which depends on compiler, compiler flags, and MPI libraries. We always use the MPI libraries compiled with the same version of the compiler as ROMS. Notice that in the distributed-memory exchanges, ROMS uses the lower level MPI communication routines. Nothing fancy.
As far as I know the ROMS code, as distributed, is free of distributed-memory (MPI) parallel bugs. If a parallel bug is found in application it is usually associated with the user customization of the code. The MPI paradigm is the easier one. However, shared-memory and serial with partitions is more difficult in ROMS. I sometimes forget this when coding. I am so used to MPI nowadays. All the adjoint-based algorithms only work in MPI.
Now, OpenMP is a protocol for shared-memory. I just fixed today a parallel bug for the biharmonic stress tensor in shared-memory and serial with partitions. See ticket. You need to update. This will be a problem for you if you are using shared-memory, OpenMP.
I always recommend users to use the build.sh or build.bach script instead of modifying the makefile.
When configuring a large application, it is always a good idea to set your grid as a multiple of 2 or 3. I actually use powers of 2, the best... This allows a lot of choices for tile partition and tile balancing for efficiency. You cannot select the number of grid points capriciously in the parallel computing world.
-
- Posts: 1
- Joined: Sun Dec 07, 2008 7:57 pm
- Location: MeteoGalicia
Re: Blowups depending on model decomposition with MPI
Hi Marcos
We have now exactly the same problem, our model roms version is 511
Did you find the solution??
We have now exactly the same problem, our model roms version is 511
Did you find the solution??
- Attachments
-
- FIG1.png (23.79 KiB) Viewed 4846 times
-
- FIG2.png (24.05 KiB) Viewed 4846 times
-
- FIG3.png (16.19 KiB) Viewed 4846 times