Parallelization of the Adjoint Model

Message

arango · #1 Unread post by **arango** » Sun Dec 07, 2003 12:48 am

Hi All,

I started working on the parallelization of ROMS adjoint model. I doing this by stages in two directories: src and new for current and new codes, respectively. In this way I can check and recover from any mistake that I make during the parallelization.

I have worked few routines already and I can see some savings in the tangent linear and big savings in the adjoint code. There is a lot of repetitions. In one case, I reduced the number of divisions in one loop from six to one. This is a tremendous saving in computational cycles. Also, the code looks much cleaner, compact, and easy to debug.

I also found a solution for parallelizing the shared-memory code. But it requires some few changes. For example, the following code (from ad_omega.F) violates shared-memory rules:

Code: Select all

        DO k=N,1,-1
          DO i=Istr,Iend
!!          tl_W(i,j,k)=tl_W(i,j,k-1)-
!!   &                  (tl_Huon(i+1,j,k)-tl_Huon(i,j,k)+
!!   &                   tl_Hvom(i,j+1,k)-tl_Hvom(i,j,k))
!!
            ad_W(i,j,k-1)=ad_W(i,j,k-1)+ad_W(i,j,k)
            ad_Huon(i+1,j  ,k)=ad_Huon(i+1,j  ,k)-ad_W(i,j,k)
            ad_Huon(i  ,j  ,k)=ad_Huon(i  ,j  ,k)+ad_W(i,j,k)
            ad_Hvom(i  ,j+1,k)=ad_Hvom(i  ,j+1,k)-ad_W(i,j,k)
            ad_Hvom(i  ,j  ,k)=ad_Hvom(i  ,j  ,k)+ad_W(i,j,k)
          END DO
        END DO

The problem here are the assigments for ad_Huon(i+1,j,k) and ad_Hvom(i,j+1,k). These statments imply an assigment outside of the parallel tile ranges (Istr,Iend) and (Jstr,Jend). These values are also computed by adjacent parallel tiles. It violates shared-memory collision rule which states two tiles cannot write to the same place in memory because it can give unexpected results. Notice that this problem does not exist in MPI.

However, notice for example, as Dan Schaffer, pointed out that

Code: Select all

            ad_Huon(3+1,j  ,k)=ad_Huon(3+1,j  ,k)-ad_W(3,j,k)
            ad_Huon(4  ,j  ,k)=ad_Huon(4  ,j  ,k)+ad_W(4,j,k)

            ad_Hvom(i  ,5+1,k)=ad_Hvom(i  ,5+1,k)-ad_W(i,5,k)
            ad_Hvom(i  ,6  ,k)=ad_Hvom(i  ,6  ,k)+ad_W(i,6,k)

Therefore,

            ad_Huon(4  ,j  ,k)=ad_Huon(4  ,j  ,k)+(ad_W(4,j,k)-ad_W(3,j,k))

            ad_Hvom(i  ,6  ,k)=ad_Hvom(i  ,6  ,k)+(ad_W(i,6,k)-ad_W(i,5,k))

So the above code becomes:

        DO k=N,1,-1
          DO i=Istr,Iend
!!          tl_W(i,j,k)=tl_W(i,j,k-1)-
!!   &                  (tl_Huon(i+1,j,k)-tl_Huon(i,j,k)+
!!   &                   tl_Hvom(i,j+1,k)-tl_Hvom(i,j,k))
!!
(1)         ad_W(i,j,k-1)=ad_W(i,j,k-1)+ad_W(i,j,k)
(2)         ad_Huon(i,j,k)=ad_Huon(i,j,k)+(ad_W(i,j,k)-ad_W(i-1,j  ,k))
(3)         ad_Hvom(i,j,k)=ad_Hvom(i,j,k)+(ad_W(i,j,k)-ad_W(i  ,j-1,k))
          END DO
        END DO

This code does not violate any parallel partition rules because it only updates (i,j,k) indices. The code, lines (2) and (3), starts looking like a gradient. Pretty neat, right?

Well, the story does not end here because we need ad_W(i-1,j,k) and ad_W(i,j-1,k). That is, we need values at the extended (i,j) ranges. But we still have the problem with statement (1). This is an integration.

I guess that I picked out a very hard example. I need to think more about this one. However, this illustrate the solution for the parallelization of the adjoint.

Now that I am deep into this business, I believe less in automatic adjoint compilers because it generates messy and very expensive code that it is difficult to parallelize. It is hard to teach a compiler how to do this efficiently. The compiler needs to have a memory capability to allow recognizing similar blocks of code to avoid computing the same terms over and over. Also because of ROMS design, it is very hard to generate the tangent linear and adjoint transformation with an automatic compiler.

Hernan G. Arango
arango@imcs.rutgers.edu