- ROMS has coarse-grained parallel design that partitions the application grid into horizontal tiles. A detailed description of this is provided in the following post.
- In serial and shared-memory (OpenMP) applications, all the state arrays are allocated for the full grid, for example we allocate the free-surface as zeta(LBi:UBi, LBj:UBj,...). In non-periodic grids, the lower and upper bounds for the horizontal dimensions are: LBi=0, UBi=Im(ng)+1, LBj=0, and LBj=Jm(ng)+1. The values of Im(ng) and Jm(ng) are computed in mod_param.F from input parameters Lm(ng) and Mm(ng), respectively, plus some padding when necessary:
Code: Select all
DO ng=1,Ngrids I_padd=(Lm(ng)+2)/2-(Lm(ng)+1)/2 J_padd=(Mm(ng)+2)/2-(Mm(ng)+1)/2 Im(ng)=Lm(ng)+I_padd Jm(ng)=Mm(ng)+J_padd ... END DO
- In distributed-memory (MPI) applications, all the state arrays are allocated for the node tile size plus the ghost points. On an interior tile, the lower and upper bounds for the horizontal dimensions are: LBi=Istr-Nghost, UBi=Iend+Nghost, LBj=Jstr-Nghost, and LBj=Jend+Nghost. Here, Istr, Iend, Jstr, Jend are the tile computation ranges in the I- and J-directions computed in routine get_tile, and Nghost is the number of ghost (halo) points (usually, Nghost=2 but sometimes Nghost=3 is required). Therefore, the state arrays are declared and allocated to a smaller size than those used in OpenMP for the same tile partition (other than 1x1). In our version of ROMS, the dimension ranges (indices) have a global mapping corresponding to the full grid. There is not a local tile remapping of the I- and J-indices. For example, if we have a non-periodic grid with Lm=512, Mm=256, NtileI=2, and NtileJ=4 (a 2x4 partition) will yield:
Then, on tile=3 (usually, CPU node=4) we will declare zeta(255:514, 63:130, ...) when Nghost=2, or zeta(254:514, 62:131, ...) when Nghost=3. Notice that since tile=3 is adjacent to the western boundary, we do not need to add ghost points at that boundary and the state arrays are dimensioned to Im(ng)+1 on those tiles. In this example, the 514 value is due to padding and not because additional 2 ghost points. The 514 value is computed from Im(ng)=Lm(ng)+I_padd with I_padd=1=(Lm(ng)+2)/2 - (Lm(ng)+1)/2, as shown in the code above. I selected these dimension values to expose several internal issues in ROMS design. By the way, the total number of horizontal physical points (no counting ghost points) is 491520 for all tiles (very close to half a million), so we can have balanced computations in all nodes (hopefully, but usually the master does additional work). This is still a large number to fit a particular application in cache.
Code: Select all
tile: 0 Itile = 0 Jtile = 0 Istr = 1 Iend = 256 Jstr = 1 Jend = 64 tile: 1 Itile = 1 Jtile = 0 Istr = 257 Iend = 512 Jstr = 1 Jend = 64 tile: 2 Itile = 0 Jtile = 1 Istr = 1 Iend = 256 Jstr = 65 Jend = 128 tile: 3 Itile = 1 Jtile = 1 Istr = 257 Iend = 512 Jstr = 65 Jend = 128 tile: 4 Itile = 0 Jtile = 2 Istr = 1 Iend = 256 Jstr = 129 Jend = 192 tile: 5 Itile = 1 Jtile = 2 Istr = 257 Iend = 512 Jstr = 129 Jend = 192 tile: 6 Itile = 0 Jtile = 3 Istr = 1 Iend = 256 Jstr = 193 Jend = 256 tile: 7 Itile = 1 Jtile = 3 Istr = 257 Iend = 512 Jstr = 193 Jend = 256
- One of the advantages of the ROMS distributed-memory design is that an application cannot be bounded by memory. As the number of grid points increases, you just need more MPI nodes to fit the application nicely on cache and avoid penalties in memory and efficiency. This is not possible in shared-memory applications because we always need the full state arrays in memory. Recall that in shared-memory, ROMS is accessing simultaneously the data in memory (global arrays) but operates in different parts of the data (tiles) to avoid race conditions (more than one thread writing the same memory location at the same time; a nasty shared-memory parallel bug with unpredictable results).
- In ROMS kernel, different decisions are made if you configure shared- or distributed-memory. Additionally, periodic boundary conditions complicate matters. It is not possible to apply periodic boundary conditions in distributed-memory when there is more than one tile partition at the periodic boundary. The node in question does not have the information from other nodes in the array assignment, you need to communicate. Recall that we have the following E-W periodic conditions in MPI applications:
Periodic boundary conditions are not applied to local private arrays because they already contain periodic information by a clever manipulation of the I- and J-indices. In distributed-memory, the periodic boundary conditions are applied during the MPI exchanges between tiles (see mp_exchange.F). This is not a problem in serial or shared-memory applications because we have the full array in memory. We just need to check for race conditions in OpenMP.
Code: Select all
IF (NtileI(ng).eq.1) THEN IF (WESTERN_EDGE) THEN DO j=Jstr,JendR A(Lm(ng)+1,j)=A(1,j) A(Lm(ng)+2,j)=A(2,j) # ifdef THREE_GHOST A(Lm(ng)+3,j)=A(3,j) # endif END DO END IF IF (EASTERN_EDGE) THEN DO j=Jstr,JendR A(-2,j)=A(Lm(ng)-2,j) A(-1,j)=A(Lm(ng)-1,j) A( 0,j)=A(Lm(ng) ,j) END DO END IF END IF
- In view of all this, having a hybrid MPI and OpenMP paradigm is very difficult, inefficient, and a waist of memory in my opinion. We always need to have a compromise between efficiency and memory requirements. Of course, different people have various opinions and biases on the subject. Several years ago, I looked into the hybrid system and read several papers on the subject. Believe or not there are tons of publications about this with much simpler models than ROMS. What I got from the literature was that MPI alone performed better than the hybrid MPI plus OpenMP system in mostly all cases with few exceptions that depended on the computer architecture and the number of parallel nodes. I don't recall seeing great performance gains in those exceptions. In the cases that acceptable hybrid performance was achieved, it was with less number of nodes (say, 16 or 32) but with similar performance to the MPI alone case. The other curious thing was the practicality of the hybrid system. The people in these publications were doing research in big super-computers and not in the computers that mostly all of us use. This was more like the work that you do to publish computer and software engineering results and to follow fancy trends. I bet that some of these people do not have an idea of complex geophysical modeling.
- Hybrid systems requires a lot of engineering and and alternate design in the computational kernels. There are very difficult in either serial and parallel I/O management. There are also very difficult to debug. I am not aware of a debugger that can do this in a transparent way. Compiling libraries require additional options and more complicated makefiles. This seems to me too much effort for so little or no gain. Sometimes, you heard people talking about this without the real knowledge of all the technical aspects and details that such a hybrid system entails. It is easy to talk or read about this, but coding it is a complete different story.
Hybrid Parallelism using both MPI and OpenMP
- arango
- Site Admin
- Posts: 1367
- Joined: Wed Feb 26, 2003 4:41 pm
- Location: DMCS, Rutgers University
- Contact:
Hybrid Parallelism using both MPI and OpenMP
The issue of hybrid parallelism in ROMS by combining both MPI and OpenMP paradigms has come up several times in this forum. So it is useful to document few facts about ROMS parallel framework. Please read this carefully if you want to follow the arguments. I apologize for the detailed information, but I need to explain this with an example :