I started working on the parallelization of ROMS adjoint model. I doing this by stages in two directories: src and new for current and new codes, respectively. In this way I can check and recover from any mistake that I make during the parallelization.
I have worked few routines already and I can see some savings in the tangent linear and big savings in the adjoint code. There is a lot of repetitions. In one case, I reduced the number of divisions in one loop from six to one. This is a tremendous saving in computational cycles. Also, the code looks much cleaner, compact, and easy to debug.
I also found a solution for parallelizing the shared-memory code. But it requires some few changes. For example, the following code (from ad_omega.F) violates shared-memory rules:
Code: Select all
DO k=N,1,-1
DO i=Istr,Iend
!! tl_W(i,j,k)=tl_W(i,j,k-1)-
!! & (tl_Huon(i+1,j,k)-tl_Huon(i,j,k)+
!! & tl_Hvom(i,j+1,k)-tl_Hvom(i,j,k))
!!
ad_W(i,j,k-1)=ad_W(i,j,k-1)+ad_W(i,j,k)
ad_Huon(i+1,j ,k)=ad_Huon(i+1,j ,k)-ad_W(i,j,k)
ad_Huon(i ,j ,k)=ad_Huon(i ,j ,k)+ad_W(i,j,k)
ad_Hvom(i ,j+1,k)=ad_Hvom(i ,j+1,k)-ad_W(i,j,k)
ad_Hvom(i ,j ,k)=ad_Hvom(i ,j ,k)+ad_W(i,j,k)
END DO
END DO
However, notice for example, as Dan Schaffer, pointed out that
Code: Select all
ad_Huon(3+1,j ,k)=ad_Huon(3+1,j ,k)-ad_W(3,j,k)
ad_Huon(4 ,j ,k)=ad_Huon(4 ,j ,k)+ad_W(4,j,k)
ad_Hvom(i ,5+1,k)=ad_Hvom(i ,5+1,k)-ad_W(i,5,k)
ad_Hvom(i ,6 ,k)=ad_Hvom(i ,6 ,k)+ad_W(i,6,k)
Therefore,
ad_Huon(4 ,j ,k)=ad_Huon(4 ,j ,k)+(ad_W(4,j,k)-ad_W(3,j,k))
ad_Hvom(i ,6 ,k)=ad_Hvom(i ,6 ,k)+(ad_W(i,6,k)-ad_W(i,5,k))
So the above code becomes:
DO k=N,1,-1
DO i=Istr,Iend
!! tl_W(i,j,k)=tl_W(i,j,k-1)-
!! & (tl_Huon(i+1,j,k)-tl_Huon(i,j,k)+
!! & tl_Hvom(i,j+1,k)-tl_Hvom(i,j,k))
!!
(1) ad_W(i,j,k-1)=ad_W(i,j,k-1)+ad_W(i,j,k)
(2) ad_Huon(i,j,k)=ad_Huon(i,j,k)+(ad_W(i,j,k)-ad_W(i-1,j ,k))
(3) ad_Hvom(i,j,k)=ad_Hvom(i,j,k)+(ad_W(i,j,k)-ad_W(i ,j-1,k))
END DO
END DO
Well, the story does not end here because we need ad_W(i-1,j,k) and ad_W(i,j-1,k). That is, we need values at the extended (i,j) ranges. But we still have the problem with statement (1). This is an integration.
I guess that I picked out a very hard example. I need to think more about this one. However, this illustrate the solution for the parallelization of the adjoint.
Now that I am deep into this business, I believe less in automatic adjoint compilers because it generates messy and very expensive code that it is difficult to parallelize. It is hard to teach a compiler how to do this efficiently. The compiler needs to have a memory capability to allow recognizing similar blocks of code to avoid computing the same terms over and over. Also because of ROMS design, it is very hard to generate the tangent linear and adjoint transformation with an automatic compiler.
Hernan G. Arango
arango@imcs.rutgers.edu