IMPORTANT: Parallel Global Reduction and Volume Conservation

Message

arango · #1 Unread post by **arango** » Thu Mar 05, 2015 1:28 am

WARNING: You cannot replay or post comments in this forum thread. If you have any comments, use the following

post instead.

In ROMS, we have the option to impose a volume conservation in applications with open boundaries if tidal forcing is not anabled:

Code: Select all

! Set lateral open boundary edge volume conservation switch for
! nonlinear model and adjoint-based algorithms. Usually activated
! with radiation boundary conditions to enforce global mass
! conservation, except if tidal forcing is enabled. [1:Ngrids].

   VolCons(west)  ==  T                            ! western  boundary
   VolCons(east)  ==  T                            ! eastern  boundary
   VolCons(south) ==  T                            ! southern boundary
   VolCons(north) ==  F                            ! northern boundary

ad_VolCons(west)  ==  T                            ! western  boundary
ad_VolCons(east)  ==  T                            ! eastern  boundary
ad_VolCons(south) ==  T                            ! southern boundary
ad_VolCons(north) ==  F                            ! northern boundary

The routines in obc_volcons.F are used to impose volume conservation. The flux integral along the open boundary (bc_flux) and its area (bc_flux) are computed in routine obc_flux_tile. Both values are used to compute the barotropic velocity correction (normal component to open boundary), ubar_xs = bc_flux / bc_area, so the volume integral along the open boundary is conserved. They are scalar values since it represents the accumulative sum along the open boundary.

The computation of bc_flux and bc_area are simple but they are subject to round-off in parallel and serial applications with tile partitions. Both values are usually large numbers, specially in big domains and deep bathymetry. This makes the round-off problem worse.

In parallel applications, we need to have a special code to carry out the global reduction operation (summation):

Code: Select all

      IF (ANY(VolCons(:,ng))) THEN
#ifdef DISTRIBUTE
        NSUB=1                           ! distributed-memory
#else
        IF (DOMAIN(ng)%SouthWest_Corner(tile).and.                      &
     &      DOMAIN(ng)%NorthEast_Corner(tile)) THEN
          NSUB=1                         ! non-tiled application
        ELSE
          NSUB=NtileX(ng)*NtileE(ng)     ! tiled application
        END IF
#endif
!$OMP CRITICAL (OBC_VOLUME)
        IF (tile_count.eq.0) THEN
          bc_flux=0.0_r8
          bc_area=0.0_r8
        END IF
        bc_area=bc_area+my_area
        bc_flux=bc_flux+my_flux
        tile_count=tile_count+1
        IF (tile_count.eq.NSUB) THEN
          tile_count=0
#ifdef DISTRIBUTE
          buffer(1)=bc_area
          buffer(2)=bc_flux
          op_handle(1)='SUM'
          op_handle(2)='SUM'
          CALL mp_reduce (ng, iNLM, 2, buffer, op_handle)
          bc_area=buffer(1)
          bc_flux=buffer(2)
#endif
          ubar_xs=bc_flux/bc_area
        END IF
!$OMP END CRITICAL (OBC_VOLUME)
      END IF

Well, here is where the round-off problem comes from and we get different solutions for different parallel partitions. Therefore, we have a replicability problem. In general, collective parallel floating-point operations can produce different results on different tile (processor) configurations. That is, for example, configuring your application on 2x5 tiles accumulate round-off errors differently than on 4x8.

Notice that in any computer, (A + B + C + D) will not give the same results as (A + D + C + B). Therefore, when we do floating-point parallel reduce operations the order of the reduction between chunks of numbers matters. This order is not deterministic in routine like mpi_allreduce, which is called by the ROMS routine mp_reduce. As matter of fact, there are three different ways in mp_reduce to compute this reduction operation using either mpi_allreduce, mpi_allgather, and mpi_isend/mpi_irecv. All three methods are subject to round-off problems.

What To Do:

Currently, there is not much that we can do. Sasha suggested to carry the summation-by-pairs to reduce the round-off error. This summation adds only comparable numbers from the chunk of parallel data.
If you are using frequent data from larger scale models, you can suppress volume conservation to see what happens. Specially, you need frequent free-surface (zeta) data. You may get into trouble if running for long (years) periods.
Impose tidal forcing at the open boundary since we need to avoid volume conservation.

Notice that we not longer can compare NetCDF files byte by byte to check for parallel bugs when volume conservation is activated:

Code: Select all

% diff r1/ocean_avg.nc r2/ocean_avg.nc
Binary files r1/ocean_avg.nc and r2/ocean_avg.nc differ

% diff r1/ocean_his.nc r2/ocean_his.nc
Binary files r1/ocean_his.nc and r2/ocean_his.nc differ

% diff r1/ocean_rst.nc r2/ocean_rst.nc
Binary files r1/ocean_rst.nc and r2/ocean_rst.nc differ

To compute binary differences, you need to activate CPP options: DEBUGGING, OUT_DOUBLE, and POSITIVE_ZERO.