I noticed that MPI-ROMS is slower that SSM-ROMS. You can combine your MPI_WAIT calls for each of the 4 neighboring directions into a single MPI_WAITALL. This should improve performance a bit.
Also, for your timers: I would suggest taking a global max of the timers rather than a global sum. The max is easier to understand as you change processor counts and is maybe a slightly better measure than the average (which one could compute by dividing your sum by Nprocs).
Dan Schaffer